2023-06-17 16:37:24,884 INFO [train.py:1064] (0/4) Training started 2023-06-17 16:37:24,896 INFO [train.py:1074] (0/4) Device: cuda:0 2023-06-17 16:37:27,213 INFO [lexicon.py:168] (0/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:37:27,424 INFO [train.py:1085] (0/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-7-1218101249-5d97868c7c-v8ngc', 'IP address': '10.177.77.18'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:37:27,424 INFO [train.py:1087] (0/4) About to create model 2023-06-17 16:37:28,000 INFO [train.py:1091] (0/4) Number of model parameters: 32327030 2023-06-17 16:37:34,460 INFO [train.py:1106] (0/4) Using DDP 2023-06-17 16:37:34,876 INFO [asr_datamodule.py:390] (0/4) About to get train cuts 2023-06-17 16:37:34,901 INFO [asr_datamodule.py:398] (0/4) About to get dev cuts 2023-06-17 16:37:34,903 INFO [asr_datamodule.py:211] (0/4) About to get Musan cuts 2023-06-17 16:37:38,010 INFO [asr_datamodule.py:216] (0/4) Enable MUSAN 2023-06-17 16:37:38,011 INFO [asr_datamodule.py:239] (0/4) Enable SpecAugment 2023-06-17 16:37:38,012 INFO [asr_datamodule.py:240] (0/4) Time warp factor: 80 2023-06-17 16:37:38,013 INFO [asr_datamodule.py:250] (0/4) Num frame mask: 10 2023-06-17 16:37:38,014 INFO [asr_datamodule.py:263] (0/4) About to create train dataset 2023-06-17 16:37:38,015 INFO [asr_datamodule.py:289] (0/4) Using DynamicBucketingSampler. 2023-06-17 16:37:41,832 INFO [asr_datamodule.py:305] (0/4) About to create train dataloader 2023-06-17 16:37:41,834 INFO [asr_datamodule.py:336] (0/4) About to create dev dataset 2023-06-17 16:37:42,543 INFO [asr_datamodule.py:354] (0/4) About to create dev dataloader 2023-06-17 16:39:51,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.31 vs. limit=5.0 2023-06-17 16:39:59,891 INFO [train.py:996] (0/4) Epoch 1, batch 0, loss[loss=10.41, simple_loss=9.46, pruned_loss=9.489, over 21767.00 frames. ], tot_loss[loss=10.41, simple_loss=9.46, pruned_loss=9.489, over 21767.00 frames. ], batch size: 102, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:39:59,893 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 16:40:52,887 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=10.49, simple_loss=9.517, pruned_loss=9.679, over 1796401.00 frames. 2023-06-17 16:40:52,888 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 22296MB 2023-06-17 16:40:55,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=4.0 2023-06-17 16:41:03,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=7.5 2023-06-17 16:41:10,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60.0, ans=0.4971875 2023-06-17 16:41:21,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=22.67 vs. limit=7.5225 2023-06-17 16:41:21,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=26.02 vs. limit=7.5225 2023-06-17 16:41:22,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=200.77 vs. limit=7.5225 2023-06-17 16:41:23,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=120.0, ans=0.494375 2023-06-17 16:41:37,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=118.75 vs. limit=5.06 2023-06-17 16:42:05,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=249.16 vs. limit=7.5675 2023-06-17 16:42:31,418 INFO [train.py:996] (0/4) Epoch 1, batch 50, loss[loss=0.9316, simple_loss=0.8313, pruned_loss=0.9017, over 16973.00 frames. ], tot_loss[loss=4.125, simple_loss=3.816, pruned_loss=3.043, over 959557.71 frames. ], batch size: 62, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:43:02,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=300.0, ans=0.8895000000000001 2023-06-17 16:43:02,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=105.46 vs. limit=7.6125 2023-06-17 16:43:09,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=7.06 vs. limit=3.054 2023-06-17 16:43:45,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=242.38 vs. limit=7.6575 2023-06-17 16:44:34,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.66 vs. limit=5.135 2023-06-17 16:44:45,953 INFO [train.py:996] (0/4) Epoch 1, batch 100, loss[loss=1.192, simple_loss=1.034, pruned_loss=1.268, over 21290.00 frames. ], tot_loss[loss=2.591, simple_loss=2.365, pruned_loss=2.108, over 1694636.16 frames. ], batch size: 176, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:44:49,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 2.605e+02 7.361e+02 5.108e+03 2.907e+04, threshold=1.472e+03, percent-clipped=0.0 2023-06-17 16:44:52,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=600.0, ans=0.471875 2023-06-17 16:44:53,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=50.82 vs. limit=7.725 2023-06-17 16:44:54,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=7.95 2023-06-17 16:44:55,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=600.0, ans=0.471875 2023-06-17 16:44:59,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=89.68 vs. limit=7.725 2023-06-17 16:45:01,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=14.28 vs. limit=4.264 2023-06-17 16:45:05,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=132.97 vs. limit=7.7475 2023-06-17 16:45:38,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=37.65 vs. limit=8.04 2023-06-17 16:45:39,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:45:39,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=52.66 vs. limit=7.77 2023-06-17 16:45:41,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=720.0, ans=5.18 2023-06-17 16:45:42,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=245.47 vs. limit=7.77 2023-06-17 16:45:45,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=164.00 vs. limit=7.77 2023-06-17 16:45:48,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=37.95 vs. limit=8.04 2023-06-17 16:46:25,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=4.312 2023-06-17 16:46:28,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=38.33 vs. limit=7.7925 2023-06-17 16:46:30,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=4.336 2023-06-17 16:46:42,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=8.13 2023-06-17 16:46:42,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=840.0, ans=7.815 2023-06-17 16:46:53,218 INFO [train.py:996] (0/4) Epoch 1, batch 150, loss[loss=0.9554, simple_loss=0.8188, pruned_loss=0.9975, over 21146.00 frames. ], tot_loss[loss=1.998, simple_loss=1.799, pruned_loss=1.738, over 2272036.62 frames. ], batch size: 143, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:46:57,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=8.175 2023-06-17 16:47:21,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=7.86 2023-06-17 16:47:53,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=34.90 vs. limit=8.265 2023-06-17 16:48:19,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080.0, ans=0.44937499999999997 2023-06-17 16:48:37,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1080.0, ans=0.44937499999999997 2023-06-17 16:48:40,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1140.0, ans=0.092875 2023-06-17 16:48:54,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=4.456 2023-06-17 16:48:58,451 INFO [train.py:996] (0/4) Epoch 1, batch 200, loss[loss=1.094, simple_loss=0.9438, pruned_loss=1.04, over 21537.00 frames. ], tot_loss[loss=1.671, simple_loss=1.489, pruned_loss=1.5, over 2716027.32 frames. ], batch size: 414, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:49:01,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.811e+01 1.173e+02 1.419e+02 1.881e+02 2.743e+02, threshold=2.839e+02, percent-clipped=0.0 2023-06-17 16:49:07,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=5.3 2023-06-17 16:49:35,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=4.504 2023-06-17 16:50:10,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=63.38 vs. limit=7.995 2023-06-17 16:50:49,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=52.15 vs. limit=8.0175 2023-06-17 16:50:59,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=8.535 2023-06-17 16:51:12,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=22.26 vs. limit=8.04 2023-06-17 16:51:18,150 INFO [train.py:996] (0/4) Epoch 1, batch 250, loss[loss=0.9734, simple_loss=0.8305, pruned_loss=0.9161, over 21611.00 frames. ], tot_loss[loss=1.469, simple_loss=1.299, pruned_loss=1.333, over 3061799.94 frames. ], batch size: 230, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:51:27,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=4.6 2023-06-17 16:51:33,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=5.39 2023-06-17 16:51:49,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1560.0, ans=0.8454 2023-06-17 16:52:17,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.53 vs. limit=5.8100000000000005 2023-06-17 16:52:19,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.87 vs. limit=5.8100000000000005 2023-06-17 16:52:21,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=17.74 vs. limit=8.1075 2023-06-17 16:52:24,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=8.715 2023-06-17 16:53:14,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1740.0, ans=8.1525 2023-06-17 16:53:22,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.42 vs. limit=5.87 2023-06-17 16:53:22,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=37.02 vs. limit=8.1525 2023-06-17 16:53:25,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.85 vs. limit=8.1525 2023-06-17 16:53:28,116 INFO [train.py:996] (0/4) Epoch 1, batch 300, loss[loss=0.85, simple_loss=0.7202, pruned_loss=0.7815, over 21833.00 frames. ], tot_loss[loss=1.319, simple_loss=1.158, pruned_loss=1.2, over 3333105.90 frames. ], batch size: 107, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:53:31,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 7.248e+01 1.102e+02 1.349e+02 1.694e+02 3.595e+02, threshold=2.697e+02, percent-clipped=2.0 2023-06-17 16:53:37,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.89 vs. limit=8.175 2023-06-17 16:53:38,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=17.23 vs. limit=8.175 2023-06-17 16:53:39,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=8.85 2023-06-17 16:53:45,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=31.88 vs. limit=8.1975 2023-06-17 16:53:45,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.72 vs. limit=5.93 2023-06-17 16:53:48,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=8.895 2023-06-17 16:54:41,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=8.94 2023-06-17 16:55:15,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1980.0, ans=0.8307 2023-06-17 16:55:37,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.52 vs. limit=6.05 2023-06-17 16:55:38,326 INFO [train.py:996] (0/4) Epoch 1, batch 350, loss[loss=0.8147, simple_loss=0.687, pruned_loss=0.729, over 21347.00 frames. ], tot_loss[loss=1.207, simple_loss=1.053, pruned_loss=1.095, over 3549084.39 frames. ], batch size: 131, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:55:48,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2100.0, ans=6.3125 2023-06-17 16:56:20,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=8.31 2023-06-17 16:56:34,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=9.165 2023-06-17 16:56:35,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2220.0, ans=0.050050000000000004 2023-06-17 16:56:37,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=9.165 2023-06-17 16:56:53,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.75 vs. limit=6.11 2023-06-17 16:57:10,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=45.09 vs. limit=8.355 2023-06-17 16:57:13,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=24.35 vs. limit=8.355 2023-06-17 16:57:14,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=20.05 vs. limit=8.355 2023-06-17 16:57:23,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=34.27 vs. limit=8.355 2023-06-17 16:57:27,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.89 vs. limit=6.14 2023-06-17 16:57:33,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=3.351 2023-06-17 16:57:37,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2340.0, ans=0.3903125 2023-06-17 16:57:44,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=8.3775 2023-06-17 16:57:46,853 INFO [train.py:996] (0/4) Epoch 1, batch 400, loss[loss=0.799, simple_loss=0.6709, pruned_loss=0.6966, over 21183.00 frames. ], tot_loss[loss=1.12, simple_loss=0.9701, pruned_loss=1.009, over 3707120.31 frames. ], batch size: 143, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:57:50,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 9.350e+01 1.224e+02 1.536e+02 2.025e+02 4.442e+02, threshold=3.072e+02, percent-clipped=8.0 2023-06-17 16:57:56,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=8.4 2023-06-17 16:57:57,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2400.0, ans=0.11 2023-06-17 16:58:03,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2460.0, ans=0.0423125 2023-06-17 16:58:06,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2460.0, ans=0.0423125 2023-06-17 16:59:08,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2580.0, ans=0.8097000000000001 2023-06-17 16:59:18,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=28.19 vs. limit=8.4675 2023-06-17 16:59:50,786 INFO [train.py:996] (0/4) Epoch 1, batch 450, loss[loss=0.7937, simple_loss=0.666, pruned_loss=0.6694, over 21185.00 frames. ], tot_loss[loss=1.062, simple_loss=0.9142, pruned_loss=0.9466, over 3841884.48 frames. ], batch size: 177, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 17:00:01,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=78.94 vs. limit=8.5125 2023-06-17 17:00:03,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=6.35 2023-06-17 17:00:04,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=5.675 2023-06-17 17:00:36,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.56 vs. limit=5.705 2023-06-17 17:01:01,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2820.0, ans=0.24230000000000002 2023-06-17 17:01:08,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=5.72 2023-06-17 17:01:20,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2880.0, ans=0.092 2023-06-17 17:01:23,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2880.0, ans=0.365 2023-06-17 17:01:45,399 INFO [train.py:996] (0/4) Epoch 1, batch 500, loss[loss=0.9047, simple_loss=0.7639, pruned_loss=0.7264, over 19883.00 frames. ], tot_loss[loss=1.041, simple_loss=0.8909, pruned_loss=0.9125, over 3937347.65 frames. ], batch size: 703, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:01:45,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3000.0, ans=0.27 2023-06-17 17:01:48,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 9.420e+01 1.754e+02 2.624e+02 3.522e+02 8.349e+02, threshold=5.248e+02, percent-clipped=35.0 2023-06-17 17:02:46,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=8.6475 2023-06-17 17:02:58,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=8.67 2023-06-17 17:03:28,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=9.885 2023-06-17 17:03:29,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=3180.0, ans=0.02844999999999999 2023-06-17 17:03:32,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=3180.0, ans=0.7887000000000001 2023-06-17 17:03:39,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=5.795 2023-06-17 17:03:59,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=5.32 2023-06-17 17:04:00,014 INFO [train.py:996] (0/4) Epoch 1, batch 550, loss[loss=0.9717, simple_loss=0.8265, pruned_loss=0.7444, over 21621.00 frames. ], tot_loss[loss=1.004, simple_loss=0.8576, pruned_loss=0.8619, over 4017338.33 frames. ], batch size: 441, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:52,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=3360.0, ans=0.07400000000000001 2023-06-17 17:04:58,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.26 vs. limit=6.71 2023-06-17 17:05:12,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=5.855 2023-06-17 17:05:14,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=5.368 2023-06-17 17:05:59,261 INFO [train.py:996] (0/4) Epoch 1, batch 600, loss[loss=0.6927, simple_loss=0.5952, pruned_loss=0.5046, over 22005.00 frames. ], tot_loss[loss=0.9687, simple_loss=0.827, pruned_loss=0.81, over 4080785.27 frames. ], batch size: 103, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:06:03,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 3.057e+02 4.199e+02 5.888e+02 1.512e+03, threshold=8.399e+02, percent-clipped=32.0 2023-06-17 17:06:03,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=3600.0, ans=0.33125 2023-06-17 17:06:04,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=6.8 2023-06-17 17:06:31,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=5.915 2023-06-17 17:06:33,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=8.8725 2023-06-17 17:06:34,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3660.0, ans=0.26339999999999997 2023-06-17 17:07:32,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=3780.0, ans=0.058249999999999996 2023-06-17 17:07:50,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=3840.0, ans=0.7656000000000001 2023-06-17 17:07:51,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=3840.0, ans=0.055999999999999994 2023-06-17 17:07:56,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=6.92 2023-06-17 17:08:07,408 INFO [train.py:996] (0/4) Epoch 1, batch 650, loss[loss=0.9003, simple_loss=0.7869, pruned_loss=0.6144, over 19746.00 frames. ], tot_loss[loss=0.9334, simple_loss=0.7983, pruned_loss=0.7582, over 4113183.44 frames. ], batch size: 703, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:09:44,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=4080.0, ans=0.7572 2023-06-17 17:09:52,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=4080.0, ans=0.7572 2023-06-17 17:09:54,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=4080.0, ans=0.04966666666666667 2023-06-17 17:10:13,703 INFO [train.py:996] (0/4) Epoch 1, batch 700, loss[loss=0.6954, simple_loss=0.6043, pruned_loss=0.4742, over 21851.00 frames. ], tot_loss[loss=0.8914, simple_loss=0.7642, pruned_loss=0.7037, over 4149803.13 frames. ], batch size: 118, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:10:16,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=10.65 2023-06-17 17:10:16,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 4.036e+02 7.786e+02 1.089e+03 2.394e+03, threshold=1.557e+03, percent-clipped=44.0 2023-06-17 17:10:38,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=4200.0, ans=0.303125 2023-06-17 17:10:50,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=4260.0, ans=0.009943478260869566 2023-06-17 17:11:54,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=9.1425 2023-06-17 17:11:59,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.83 vs. limit=10.785 2023-06-17 17:12:06,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4440.0, ans=0.2556 2023-06-17 17:12:22,076 INFO [train.py:996] (0/4) Epoch 1, batch 750, loss[loss=0.6119, simple_loss=0.5308, pruned_loss=0.4128, over 21854.00 frames. ], tot_loss[loss=0.8472, simple_loss=0.7285, pruned_loss=0.6502, over 4185740.40 frames. ], batch size: 98, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:12:41,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4500.0, ans=0.255 2023-06-17 17:13:07,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=10.92 2023-06-17 17:13:43,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=4620.0, ans=0.7383 2023-06-17 17:13:45,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=9.2325 2023-06-17 17:13:53,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=3.702 2023-06-17 17:14:17,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=4740.0, ans=0.04691666666666667 2023-06-17 17:14:30,924 INFO [train.py:996] (0/4) Epoch 1, batch 800, loss[loss=0.5813, simple_loss=0.5187, pruned_loss=0.3614, over 21703.00 frames. ], tot_loss[loss=0.8068, simple_loss=0.6963, pruned_loss=0.6021, over 4203821.23 frames. ], batch size: 124, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:14:33,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 4.647e+02 7.147e+02 1.104e+03 3.003e+03, threshold=1.429e+03, percent-clipped=10.0 2023-06-17 17:14:49,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=9.3 2023-06-17 17:15:18,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=4860.0, ans=0.2721875 2023-06-17 17:15:30,125 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=9.345 2023-06-17 17:15:47,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.73 vs. limit=7.46 2023-06-17 17:16:16,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=11.28 2023-06-17 17:16:32,220 INFO [train.py:996] (0/4) Epoch 1, batch 850, loss[loss=0.5499, simple_loss=0.4852, pruned_loss=0.3476, over 21111.00 frames. ], tot_loss[loss=0.7671, simple_loss=0.6644, pruned_loss=0.5573, over 4226370.49 frames. ], batch size: 143, lr: 4.49e-02, grad_scale: 4.0 2023-06-17 17:18:49,067 INFO [train.py:996] (0/4) Epoch 1, batch 900, loss[loss=0.6749, simple_loss=0.5876, pruned_loss=0.4347, over 21730.00 frames. ], tot_loss[loss=0.7309, simple_loss=0.6361, pruned_loss=0.5163, over 4242784.35 frames. ], batch size: 389, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:18:55,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 5.086e+02 7.748e+02 1.151e+03 3.891e+03, threshold=1.550e+03, percent-clipped=18.0 2023-06-17 17:19:04,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=5460.0, ans=0.04391666666666667 2023-06-17 17:20:56,019 INFO [train.py:996] (0/4) Epoch 1, batch 950, loss[loss=0.5867, simple_loss=0.5144, pruned_loss=0.3686, over 21543.00 frames. ], tot_loss[loss=0.7054, simple_loss=0.6164, pruned_loss=0.4855, over 4257040.97 frames. ], batch size: 548, lr: 4.48e-02, grad_scale: 4.0 2023-06-17 17:21:24,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.71 vs. limit=3.855 2023-06-17 17:21:33,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=5760.0, ans=0.04266666666666667 2023-06-17 17:21:34,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=9.66 2023-06-17 17:22:17,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5820.0, ans=0.2271875 2023-06-17 17:22:18,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=11.865 2023-06-17 17:22:27,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=9.705 2023-06-17 17:22:33,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5880.0, ans=0.2412 2023-06-17 17:22:42,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5940.0, ans=0.2215625 2023-06-17 17:22:52,028 INFO [train.py:996] (0/4) Epoch 1, batch 1000, loss[loss=0.5697, simple_loss=0.5177, pruned_loss=0.3297, over 21792.00 frames. ], tot_loss[loss=0.6819, simple_loss=0.5987, pruned_loss=0.4574, over 4260224.52 frames. ], batch size: 124, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:22:53,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=12.0 2023-06-17 17:23:13,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 4.662e+02 7.435e+02 1.273e+03 3.855e+03, threshold=1.487e+03, percent-clipped=17.0 2023-06-17 17:24:30,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6180.0, ans=0.0 2023-06-17 17:24:31,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=9.817499999999999 2023-06-17 17:25:17,337 INFO [train.py:996] (0/4) Epoch 1, batch 1050, loss[loss=0.5328, simple_loss=0.479, pruned_loss=0.3135, over 21453.00 frames. ], tot_loss[loss=0.661, simple_loss=0.5833, pruned_loss=0.4326, over 4266564.07 frames. ], batch size: 212, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:27:39,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6540.0, ans=0.23459999999999998 2023-06-17 17:27:45,204 INFO [train.py:996] (0/4) Epoch 1, batch 1100, loss[loss=0.6943, simple_loss=0.6278, pruned_loss=0.4019, over 21536.00 frames. ], tot_loss[loss=0.6378, simple_loss=0.566, pruned_loss=0.4076, over 4263862.08 frames. ], batch size: 471, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:27:50,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=6600.0, ans=0.058750000000000004 2023-06-17 17:28:00,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 3.328e+02 5.726e+02 1.108e+03 4.215e+03, threshold=1.145e+03, percent-clipped=17.0 2023-06-17 17:28:05,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=6660.0, ans=0.1878125 2023-06-17 17:28:27,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=6720.0, ans=0.185 2023-06-17 17:28:30,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=6720.0, ans=0.6648000000000001 2023-06-17 17:28:46,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=6780.0, ans=0.1821875 2023-06-17 17:30:09,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=10.065 2023-06-17 17:30:10,957 INFO [train.py:996] (0/4) Epoch 1, batch 1150, loss[loss=0.5394, simple_loss=0.4952, pruned_loss=0.3022, over 21829.00 frames. ], tot_loss[loss=0.6168, simple_loss=0.5507, pruned_loss=0.3852, over 4260262.49 frames. ], batch size: 118, lr: 4.47e-02, grad_scale: 4.0 2023-06-17 17:30:41,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=6960.0, ans=0.6564 2023-06-17 17:30:43,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=6960.0, ans=0.17375000000000002 2023-06-17 17:30:43,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=10.11 2023-06-17 17:30:44,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=6960.0, ans=0.17375000000000002 2023-06-17 17:31:46,857 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:31:47,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=6.77 2023-06-17 17:32:17,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=7140.0, ans=10.1775 2023-06-17 17:32:20,711 INFO [train.py:996] (0/4) Epoch 1, batch 1200, loss[loss=0.5264, simple_loss=0.5027, pruned_loss=0.2727, over 21278.00 frames. ], tot_loss[loss=0.6062, simple_loss=0.5437, pruned_loss=0.3713, over 4269439.73 frames. ], batch size: 548, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:32:29,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=7200.0, ans=0.648 2023-06-17 17:32:36,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 4.923e+02 7.154e+02 1.207e+03 2.545e+03, threshold=1.431e+03, percent-clipped=26.0 2023-06-17 17:32:52,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=10.2225 2023-06-17 17:33:55,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.07 vs. limit=6.86 2023-06-17 17:34:15,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=7440.0, ans=0.15125 2023-06-17 17:34:27,343 INFO [train.py:996] (0/4) Epoch 1, batch 1250, loss[loss=0.5552, simple_loss=0.5097, pruned_loss=0.3091, over 21502.00 frames. ], tot_loss[loss=0.5968, simple_loss=0.538, pruned_loss=0.3584, over 4276594.19 frames. ], batch size: 548, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:34:59,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=7560.0, ans=0.00922608695652174 2023-06-17 17:36:12,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7680.0, ans=0.2232 2023-06-17 17:36:39,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=6.9350000000000005 2023-06-17 17:36:45,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=10.425 2023-06-17 17:36:45,489 INFO [train.py:996] (0/4) Epoch 1, batch 1300, loss[loss=0.4894, simple_loss=0.435, pruned_loss=0.2858, over 20874.00 frames. ], tot_loss[loss=0.5872, simple_loss=0.5319, pruned_loss=0.3467, over 4283589.74 frames. ], batch size: 608, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:36:50,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=7800.0, ans=0.034166666666666665 2023-06-17 17:37:02,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.002e+02 7.251e+02 1.294e+03 4.242e+03, threshold=1.450e+03, percent-clipped=21.0 2023-06-17 17:38:00,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7980.0, ans=0.2202 2023-06-17 17:38:46,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=8040.0, ans=0.04949747468305833 2023-06-17 17:38:48,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=10.515 2023-06-17 17:38:54,703 INFO [train.py:996] (0/4) Epoch 1, batch 1350, loss[loss=0.5301, simple_loss=0.4941, pruned_loss=0.2866, over 21837.00 frames. ], tot_loss[loss=0.576, simple_loss=0.5249, pruned_loss=0.3342, over 4285518.39 frames. ], batch size: 332, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:39:02,900 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:39:04,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=8100.0, ans=0.03291666666666667 2023-06-17 17:39:34,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=8160.0, ans=0.03266666666666667 2023-06-17 17:39:42,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=8160.0, ans=0.009095652173913043 2023-06-17 17:40:23,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8280.0, ans=0.2172 2023-06-17 17:41:08,262 INFO [train.py:996] (0/4) Epoch 1, batch 1400, loss[loss=0.7058, simple_loss=0.6296, pruned_loss=0.4063, over 21571.00 frames. ], tot_loss[loss=0.5639, simple_loss=0.516, pruned_loss=0.3228, over 4281607.47 frames. ], batch size: 507, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:41:09,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=10.65 2023-06-17 17:41:24,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 4.733e+02 7.957e+02 1.163e+03 2.485e+03, threshold=1.591e+03, percent-clipped=13.0 2023-06-17 17:41:26,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=10.65 2023-06-17 17:41:59,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=8460.0, ans=0.03141666666666667 2023-06-17 17:42:01,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.95 vs. limit=7.13 2023-06-17 17:42:44,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=8580.0, ans=0.2 2023-06-17 17:43:12,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=8640.0, ans=0.008991304347826088 2023-06-17 17:43:23,578 INFO [train.py:996] (0/4) Epoch 1, batch 1450, loss[loss=0.5576, simple_loss=0.5164, pruned_loss=0.3035, over 21568.00 frames. ], tot_loss[loss=0.5572, simple_loss=0.5119, pruned_loss=0.315, over 4281464.15 frames. ], batch size: 230, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:43:24,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=8700.0, ans=0.030416666666666668 2023-06-17 17:44:37,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8820.0, ans=0.2118 2023-06-17 17:45:21,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=8940.0, ans=0.02941666666666667 2023-06-17 17:45:34,295 INFO [train.py:996] (0/4) Epoch 1, batch 1500, loss[loss=0.497, simple_loss=0.4784, pruned_loss=0.2555, over 21825.00 frames. ], tot_loss[loss=0.5533, simple_loss=0.5101, pruned_loss=0.3095, over 4284132.77 frames. ], batch size: 371, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:46:08,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.868e+02 8.441e+02 1.240e+03 3.321e+03, threshold=1.688e+03, percent-clipped=12.0 2023-06-17 17:46:56,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=9180.0, ans=0.02841666666666667 2023-06-17 17:47:27,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=9180.0, ans=0.04949747468305833 2023-06-17 17:47:30,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=9180.0, ans=0.5787 2023-06-17 17:47:57,232 INFO [train.py:996] (0/4) Epoch 1, batch 1550, loss[loss=0.3988, simple_loss=0.4078, pruned_loss=0.1869, over 21173.00 frames. ], tot_loss[loss=0.5393, simple_loss=0.5007, pruned_loss=0.2975, over 4286213.35 frames. ], batch size: 143, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:47:59,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=9300.0, ans=0.07 2023-06-17 17:48:50,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=9360.0, ans=0.008834782608695652 2023-06-17 17:49:27,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=9420.0, ans=0.125 2023-06-17 17:49:33,384 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:49:37,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=9480.0, ans=0.05 2023-06-17 17:50:25,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=9600.0, ans=0.125 2023-06-17 17:50:26,906 INFO [train.py:996] (0/4) Epoch 1, batch 1600, loss[loss=0.5012, simple_loss=0.4894, pruned_loss=0.2531, over 21704.00 frames. ], tot_loss[loss=0.529, simple_loss=0.494, pruned_loss=0.2884, over 4284368.94 frames. ], batch size: 263, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:50:28,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=9600.0, ans=0.125 2023-06-17 17:50:36,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.850e+02 5.686e+02 1.025e+03 3.086e+03, threshold=1.137e+03, percent-clipped=9.0 2023-06-17 17:51:21,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=9660.0, ans=0.125 2023-06-17 17:51:44,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9720.0, ans=0.20279999999999998 2023-06-17 17:52:43,634 INFO [train.py:996] (0/4) Epoch 1, batch 1650, loss[loss=0.6457, simple_loss=0.5786, pruned_loss=0.3629, over 21449.00 frames. ], tot_loss[loss=0.5197, simple_loss=0.4885, pruned_loss=0.28, over 4284229.45 frames. ], batch size: 471, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:54:30,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=11.28 2023-06-17 17:54:31,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=10080.0, ans=0.5472 2023-06-17 17:54:33,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=8.032 2023-06-17 17:55:02,456 INFO [train.py:996] (0/4) Epoch 1, batch 1700, loss[loss=0.495, simple_loss=0.4849, pruned_loss=0.2499, over 21604.00 frames. ], tot_loss[loss=0.519, simple_loss=0.4903, pruned_loss=0.2772, over 4284449.12 frames. ], batch size: 414, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:55:31,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 4.663e+02 7.889e+02 1.170e+03 3.370e+03, threshold=1.578e+03, percent-clipped=25.0 2023-06-17 17:56:39,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=10380.0, ans=0.05 2023-06-17 17:57:02,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10380.0, ans=0.19619999999999999 2023-06-17 17:57:13,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=10440.0, ans=0.5346000000000001 2023-06-17 17:57:14,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10440.0, ans=0.1956 2023-06-17 17:57:23,251 INFO [train.py:996] (0/4) Epoch 1, batch 1750, loss[loss=0.3173, simple_loss=0.3339, pruned_loss=0.1459, over 21328.00 frames. ], tot_loss[loss=0.5091, simple_loss=0.4866, pruned_loss=0.2676, over 4288718.50 frames. ], batch size: 176, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:59:56,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=10740.0, ans=0.02191666666666667 2023-06-17 18:00:11,321 INFO [train.py:996] (0/4) Epoch 1, batch 1800, loss[loss=0.6106, simple_loss=0.5822, pruned_loss=0.3191, over 21417.00 frames. ], tot_loss[loss=0.4964, simple_loss=0.4781, pruned_loss=0.2583, over 4281799.75 frames. ], batch size: 507, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 18:00:28,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 4.588e+02 7.695e+02 1.107e+03 4.356e+03, threshold=1.539e+03, percent-clipped=16.0 2023-06-17 18:01:10,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.50 vs. limit=7.715 2023-06-17 18:01:17,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=11.594999999999999 2023-06-17 18:01:54,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=10980.0, ans=0.125 2023-06-17 18:02:36,586 INFO [train.py:996] (0/4) Epoch 1, batch 1850, loss[loss=0.4637, simple_loss=0.432, pruned_loss=0.2483, over 20330.00 frames. ], tot_loss[loss=0.4917, simple_loss=0.4784, pruned_loss=0.2527, over 4279918.15 frames. ], batch size: 702, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:02:48,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=11100.0, ans=0.04949747468305833 2023-06-17 18:04:08,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=11280.0, ans=0.01966666666666667 2023-06-17 18:04:37,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=11340.0, ans=0.008404347826086957 2023-06-17 18:04:54,553 INFO [train.py:996] (0/4) Epoch 1, batch 1900, loss[loss=0.4134, simple_loss=0.4122, pruned_loss=0.2063, over 21669.00 frames. ], tot_loss[loss=0.4868, simple_loss=0.4746, pruned_loss=0.2494, over 4287399.31 frames. ], batch size: 298, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:05:12,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 4.389e+02 6.592e+02 1.010e+03 2.305e+03, threshold=1.318e+03, percent-clipped=4.0 2023-06-17 18:05:12,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=11400.0, ans=0.01916666666666667 2023-06-17 18:05:16,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=11460.0, ans=0.49890000000000007 2023-06-17 18:05:18,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=11460.0, ans=0.125 2023-06-17 18:06:14,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=11.82 2023-06-17 18:06:23,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=11580.0, ans=10.0 2023-06-17 18:07:05,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=11.8875 2023-06-17 18:07:06,307 INFO [train.py:996] (0/4) Epoch 1, batch 1950, loss[loss=0.4374, simple_loss=0.4162, pruned_loss=0.2293, over 21793.00 frames. ], tot_loss[loss=0.4793, simple_loss=0.467, pruned_loss=0.2455, over 4277688.08 frames. ], batch size: 372, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 18:07:35,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=11760.0, ans=0.125 2023-06-17 18:09:07,378 INFO [train.py:996] (0/4) Epoch 1, batch 2000, loss[loss=0.4415, simple_loss=0.4632, pruned_loss=0.2099, over 21762.00 frames. ], tot_loss[loss=0.4668, simple_loss=0.4589, pruned_loss=0.2372, over 4275190.83 frames. ], batch size: 332, lr: 4.42e-02, grad_scale: 16.0 2023-06-17 18:09:37,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 4.885e+02 7.905e+02 1.281e+03 2.485e+03, threshold=1.581e+03, percent-clipped=23.0 2023-06-17 18:10:12,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12120.0, ans=0.125 2023-06-17 18:10:30,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=12180.0, ans=0.035 2023-06-17 18:11:06,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=12240.0, ans=0.125 2023-06-17 18:11:07,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=8.896 2023-06-17 18:11:32,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=12240.0, ans=0.125 2023-06-17 18:11:39,760 INFO [train.py:996] (0/4) Epoch 1, batch 2050, loss[loss=0.4187, simple_loss=0.4311, pruned_loss=0.2032, over 21577.00 frames. ], tot_loss[loss=0.4677, simple_loss=0.4612, pruned_loss=0.2369, over 4276651.87 frames. ], batch size: 263, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 18:11:49,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=12300.0, ans=0.125 2023-06-17 18:13:24,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.18 2023-06-17 18:13:36,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=12540.0, ans=0.4611 2023-06-17 18:13:46,858 INFO [train.py:996] (0/4) Epoch 1, batch 2100, loss[loss=0.3584, simple_loss=0.3609, pruned_loss=0.1779, over 21414.00 frames. ], tot_loss[loss=0.473, simple_loss=0.4665, pruned_loss=0.2396, over 4282190.95 frames. ], batch size: 212, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 18:13:54,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=12600.0, ans=0.014166666666666668 2023-06-17 18:13:57,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=12600.0, ans=0.389 2023-06-17 18:13:59,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 5.167e+02 7.622e+02 1.111e+03 2.066e+03, threshold=1.524e+03, percent-clipped=6.0 2023-06-17 18:14:38,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=17.04 2023-06-17 18:15:05,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12720.0, ans=0.125 2023-06-17 18:15:56,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=12840.0, ans=0.125 2023-06-17 18:16:06,965 INFO [train.py:996] (0/4) Epoch 1, batch 2150, loss[loss=0.4232, simple_loss=0.4219, pruned_loss=0.2123, over 21348.00 frames. ], tot_loss[loss=0.4677, simple_loss=0.4622, pruned_loss=0.2365, over 4278524.12 frames. ], batch size: 131, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 18:16:29,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=12.36 2023-06-17 18:16:36,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=4.944 2023-06-17 18:17:27,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=13020.0, ans=0.125 2023-06-17 18:17:30,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=13080.0, ans=0.008026086956521739 2023-06-17 18:17:38,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=13080.0, ans=0.012166666666666673 2023-06-17 18:17:38,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13080.0, ans=0.0 2023-06-17 18:18:18,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=13140.0, ans=0.1 2023-06-17 18:18:29,030 INFO [train.py:996] (0/4) Epoch 1, batch 2200, loss[loss=0.4031, simple_loss=0.4382, pruned_loss=0.184, over 21762.00 frames. ], tot_loss[loss=0.4631, simple_loss=0.4617, pruned_loss=0.2322, over 4278015.13 frames. ], batch size: 298, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 18:18:48,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 4.529e+02 5.924e+02 1.033e+03 2.265e+03, threshold=1.185e+03, percent-clipped=8.0 2023-06-17 18:18:48,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=13260.0, ans=0.125 2023-06-17 18:18:57,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=13260.0, ans=0.4359 2023-06-17 18:20:35,188 INFO [train.py:996] (0/4) Epoch 1, batch 2250, loss[loss=0.4632, simple_loss=0.4508, pruned_loss=0.2378, over 21332.00 frames. ], tot_loss[loss=0.4502, simple_loss=0.4539, pruned_loss=0.2232, over 4275569.38 frames. ], batch size: 471, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:21:55,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=13680.0, ans=0.007895652173913043 2023-06-17 18:22:04,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=9.472000000000001 2023-06-17 18:22:06,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=12.629999999999999 2023-06-17 18:22:30,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=13740.0, ans=0.007882608695652174 2023-06-17 18:22:31,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=13740.0, ans=0.035 2023-06-17 18:22:35,734 INFO [train.py:996] (0/4) Epoch 1, batch 2300, loss[loss=0.3895, simple_loss=0.3903, pruned_loss=0.1943, over 21183.00 frames. ], tot_loss[loss=0.4445, simple_loss=0.4481, pruned_loss=0.2204, over 4267058.97 frames. ], batch size: 176, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:22:53,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=13800.0, ans=0.00916666666666667 2023-06-17 18:22:54,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 4.349e+02 7.156e+02 9.563e+02 2.862e+03, threshold=1.431e+03, percent-clipped=11.0 2023-06-17 18:24:38,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=14040.0, ans=0.00816666666666667 2023-06-17 18:24:50,683 INFO [train.py:996] (0/4) Epoch 1, batch 2350, loss[loss=0.4741, simple_loss=0.5327, pruned_loss=0.2078, over 20713.00 frames. ], tot_loss[loss=0.4403, simple_loss=0.4433, pruned_loss=0.2187, over 4260087.29 frames. ], batch size: 607, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:25:50,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=14220.0, ans=0.125 2023-06-17 18:27:03,568 INFO [train.py:996] (0/4) Epoch 1, batch 2400, loss[loss=0.4847, simple_loss=0.474, pruned_loss=0.2478, over 21556.00 frames. ], tot_loss[loss=0.4483, simple_loss=0.4507, pruned_loss=0.2229, over 4263344.52 frames. ], batch size: 441, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:27:24,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.803e+02 6.612e+02 1.169e+03 2.103e+03, threshold=1.322e+03, percent-clipped=15.0 2023-06-17 18:28:06,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14520.0, ans=0.125 2023-06-17 18:28:12,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=14520.0, ans=0.125 2023-06-17 18:28:33,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=14580.0, ans=0.05 2023-06-17 18:28:44,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=18.48 2023-06-17 18:29:04,436 INFO [train.py:996] (0/4) Epoch 1, batch 2450, loss[loss=0.3544, simple_loss=0.3682, pruned_loss=0.1703, over 21398.00 frames. ], tot_loss[loss=0.4542, simple_loss=0.4554, pruned_loss=0.2265, over 4268230.77 frames. ], batch size: 212, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:29:20,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=14760.0, ans=0.007660869565217391 2023-06-17 18:30:02,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14880.0, ans=0.125 2023-06-17 18:30:47,066 INFO [train.py:996] (0/4) Epoch 1, batch 2500, loss[loss=0.4199, simple_loss=0.4566, pruned_loss=0.1916, over 21664.00 frames. ], tot_loss[loss=0.448, simple_loss=0.4513, pruned_loss=0.2224, over 4267041.04 frames. ], batch size: 332, lr: 4.38e-02, grad_scale: 16.0 2023-06-17 18:30:51,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=15000.0, ans=0.125 2023-06-17 18:31:09,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.980e+02 5.871e+02 7.826e+02 2.441e+03, threshold=1.174e+03, percent-clipped=5.0 2023-06-17 18:32:56,414 INFO [train.py:996] (0/4) Epoch 1, batch 2550, loss[loss=0.3895, simple_loss=0.4135, pruned_loss=0.1828, over 21843.00 frames. ], tot_loss[loss=0.4438, simple_loss=0.4492, pruned_loss=0.2192, over 4258399.21 frames. ], batch size: 98, lr: 4.38e-02, grad_scale: 16.0 2023-06-17 18:33:19,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=15360.0, ans=0.4304 2023-06-17 18:33:28,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15420.0, ans=0.1458 2023-06-17 18:33:53,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15420.0, ans=0.1458 2023-06-17 18:34:46,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15540.0, ans=0.1446 2023-06-17 18:35:02,572 INFO [train.py:996] (0/4) Epoch 1, batch 2600, loss[loss=0.4884, simple_loss=0.4856, pruned_loss=0.2456, over 21598.00 frames. ], tot_loss[loss=0.4471, simple_loss=0.4506, pruned_loss=0.2218, over 4268536.66 frames. ], batch size: 415, lr: 4.37e-02, grad_scale: 16.0 2023-06-17 18:35:16,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 4.497e+02 6.428e+02 1.038e+03 2.322e+03, threshold=1.286e+03, percent-clipped=17.0 2023-06-17 18:35:18,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15660.0, ans=0.1434 2023-06-17 18:36:10,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=5.3580000000000005 2023-06-17 18:36:51,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=15840.0, ans=0.3456 2023-06-17 18:36:56,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.22 vs. limit=12.92 2023-06-17 18:37:03,977 INFO [train.py:996] (0/4) Epoch 1, batch 2650, loss[loss=0.3596, simple_loss=0.4019, pruned_loss=0.1587, over 17423.00 frames. ], tot_loss[loss=0.4459, simple_loss=0.4504, pruned_loss=0.2207, over 4268944.09 frames. ], batch size: 60, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:37:38,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=15960.0, ans=0.125 2023-06-17 18:37:42,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=16020.0, ans=0.007386956521739131 2023-06-17 18:38:40,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=16080.0, ans=0.125 2023-06-17 18:39:05,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=16140.0, ans=0.33510000000000006 2023-06-17 18:39:09,547 INFO [train.py:996] (0/4) Epoch 1, batch 2700, loss[loss=0.4875, simple_loss=0.5126, pruned_loss=0.2313, over 21355.00 frames. ], tot_loss[loss=0.4369, simple_loss=0.4446, pruned_loss=0.2146, over 4261677.80 frames. ], batch size: 548, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:39:14,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=5.43 2023-06-17 18:39:27,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=16260.0, ans=0.04949747468305833 2023-06-17 18:39:28,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 4.133e+02 5.896e+02 7.988e+02 2.040e+03, threshold=1.179e+03, percent-clipped=10.0 2023-06-17 18:41:13,169 INFO [train.py:996] (0/4) Epoch 1, batch 2750, loss[loss=0.4389, simple_loss=0.4781, pruned_loss=0.1999, over 21821.00 frames. ], tot_loss[loss=0.4321, simple_loss=0.4416, pruned_loss=0.2113, over 4262594.23 frames. ], batch size: 298, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:42:28,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=10.648 2023-06-17 18:43:47,053 INFO [train.py:996] (0/4) Epoch 1, batch 2800, loss[loss=0.4463, simple_loss=0.4658, pruned_loss=0.2134, over 21766.00 frames. ], tot_loss[loss=0.4342, simple_loss=0.4447, pruned_loss=0.2118, over 4270293.23 frames. ], batch size: 332, lr: 4.36e-02, grad_scale: 16.0 2023-06-17 18:43:47,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=16800.0, ans=0.31200000000000006 2023-06-17 18:43:50,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.36 vs. limit=5.52 2023-06-17 18:43:52,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=13.8 2023-06-17 18:44:18,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 4.703e+02 6.814e+02 1.223e+03 2.130e+03, threshold=1.363e+03, percent-clipped=25.0 2023-06-17 18:44:20,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=16860.0, ans=0.007204347826086957 2023-06-17 18:44:23,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=16860.0, ans=0.125 2023-06-17 18:46:04,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=17100.0, ans=0.125 2023-06-17 18:46:05,393 INFO [train.py:996] (0/4) Epoch 1, batch 2850, loss[loss=0.27, simple_loss=0.2939, pruned_loss=0.123, over 21136.00 frames. ], tot_loss[loss=0.4326, simple_loss=0.4434, pruned_loss=0.2109, over 4270717.92 frames. ], batch size: 143, lr: 4.35e-02, grad_scale: 16.0 2023-06-17 18:46:23,515 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:46:45,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=17160.0, ans=0.125 2023-06-17 18:47:27,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17280.0, ans=0.125 2023-06-17 18:48:20,775 INFO [train.py:996] (0/4) Epoch 1, batch 2900, loss[loss=0.467, simple_loss=0.4608, pruned_loss=0.2366, over 21898.00 frames. ], tot_loss[loss=0.4264, simple_loss=0.438, pruned_loss=0.2074, over 4267611.50 frames. ], batch size: 371, lr: 4.35e-02, grad_scale: 16.0 2023-06-17 18:48:25,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=17400.0, ans=0.007086956521739131 2023-06-17 18:48:26,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=17400.0, ans=0.125 2023-06-17 18:48:48,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 4.392e+02 5.988e+02 8.416e+02 1.775e+03, threshold=1.198e+03, percent-clipped=6.0 2023-06-17 18:49:45,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=17580.0, ans=0.125 2023-06-17 18:49:57,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17580.0, ans=0.125 2023-06-17 18:50:57,611 INFO [train.py:996] (0/4) Epoch 1, batch 2950, loss[loss=0.3603, simple_loss=0.3764, pruned_loss=0.1721, over 21630.00 frames. ], tot_loss[loss=0.4281, simple_loss=0.4406, pruned_loss=0.2078, over 4272449.67 frames. ], batch size: 263, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:51:02,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=17700.0, ans=0.125 2023-06-17 18:51:17,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.22 vs. limit=9.440000000000001 2023-06-17 18:51:51,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=17820.0, ans=14.182500000000001 2023-06-17 18:51:54,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=9.455 2023-06-17 18:52:23,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=17880.0, ans=0.125 2023-06-17 18:53:14,233 INFO [train.py:996] (0/4) Epoch 1, batch 3000, loss[loss=0.5541, simple_loss=0.5291, pruned_loss=0.2895, over 21376.00 frames. ], tot_loss[loss=0.4299, simple_loss=0.4442, pruned_loss=0.2078, over 4277100.82 frames. ], batch size: 508, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:53:14,234 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 18:54:04,508 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.4878, 4.3712, 3.9784, 3.7966], device='cuda:0') 2023-06-17 18:54:05,128 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3426, simple_loss=0.4236, pruned_loss=0.1308, over 1796401.00 frames. 2023-06-17 18:54:05,130 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23584MB 2023-06-17 18:54:33,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 4.493e+02 5.938e+02 7.860e+02 2.320e+03, threshold=1.188e+03, percent-clipped=8.0 2023-06-17 18:54:36,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=18060.0, ans=0.125 2023-06-17 18:54:38,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=18060.0, ans=0.125 2023-06-17 18:54:43,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.36 vs. limit=14.2725 2023-06-17 18:56:00,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=14.34 2023-06-17 18:56:05,088 INFO [train.py:996] (0/4) Epoch 1, batch 3050, loss[loss=0.2266, simple_loss=0.2625, pruned_loss=0.09537, over 16457.00 frames. ], tot_loss[loss=0.4279, simple_loss=0.4449, pruned_loss=0.2055, over 4270901.23 frames. ], batch size: 60, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:56:17,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=18300.0, ans=0.0 2023-06-17 18:56:18,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=18300.0, ans=0.125 2023-06-17 18:56:33,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=18360.0, ans=0.006878260869565217 2023-06-17 18:56:37,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=18360.0, ans=0.2574000000000001 2023-06-17 18:57:17,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=14.407499999999999 2023-06-17 18:58:29,011 INFO [train.py:996] (0/4) Epoch 1, batch 3100, loss[loss=0.3779, simple_loss=0.4232, pruned_loss=0.1663, over 21831.00 frames. ], tot_loss[loss=0.4217, simple_loss=0.4411, pruned_loss=0.2011, over 4279274.29 frames. ], batch size: 282, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:58:48,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=18660.0, ans=0.2469 2023-06-17 18:58:50,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.527e+02 3.860e+02 4.806e+02 7.218e+02 1.901e+03, threshold=9.611e+02, percent-clipped=6.0 2023-06-17 18:59:13,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=18720.0, ans=0.0068 2023-06-17 18:59:58,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=18780.0, ans=0.125 2023-06-17 19:00:13,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=18840.0, ans=0.125 2023-06-17 19:00:31,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=18840.0, ans=0.0 2023-06-17 19:00:55,769 INFO [train.py:996] (0/4) Epoch 1, batch 3150, loss[loss=0.3965, simple_loss=0.3688, pruned_loss=0.2121, over 19990.00 frames. ], tot_loss[loss=0.4257, simple_loss=0.4436, pruned_loss=0.2039, over 4275603.91 frames. ], batch size: 704, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 19:01:46,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=19020.0, ans=0.125 2023-06-17 19:02:11,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=19020.0, ans=0.04949747468305833 2023-06-17 19:02:13,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=14.655000000000001 2023-06-17 19:03:25,408 INFO [train.py:996] (0/4) Epoch 1, batch 3200, loss[loss=0.4216, simple_loss=0.4616, pruned_loss=0.1908, over 21621.00 frames. ], tot_loss[loss=0.4249, simple_loss=0.4439, pruned_loss=0.2029, over 4273251.69 frames. ], batch size: 263, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 19:03:27,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=25.59 vs. limit=21.9 2023-06-17 19:04:02,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.900e+02 5.632e+02 8.444e+02 2.494e+03, threshold=1.126e+03, percent-clipped=20.0 2023-06-17 19:04:07,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=14.7225 2023-06-17 19:05:03,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=14.7675 2023-06-17 19:05:52,665 INFO [train.py:996] (0/4) Epoch 1, batch 3250, loss[loss=0.4278, simple_loss=0.4219, pruned_loss=0.2168, over 19905.00 frames. ], tot_loss[loss=0.4271, simple_loss=0.4445, pruned_loss=0.2049, over 4271301.82 frames. ], batch size: 702, lr: 4.31e-02, grad_scale: 16.0 2023-06-17 19:06:09,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=19560.0, ans=0.125 2023-06-17 19:07:16,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=19680.0, ans=0.006591304347826087 2023-06-17 19:07:38,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19740.0, ans=0.125 2023-06-17 19:07:59,373 INFO [train.py:996] (0/4) Epoch 1, batch 3300, loss[loss=0.4484, simple_loss=0.4623, pruned_loss=0.2173, over 21270.00 frames. ], tot_loss[loss=0.4226, simple_loss=0.4392, pruned_loss=0.203, over 4270677.53 frames. ], batch size: 548, lr: 4.31e-02, grad_scale: 16.0 2023-06-17 19:08:07,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19800.0, ans=0.10200000000000001 2023-06-17 19:08:18,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=19860.0, ans=0.006552173913043479 2023-06-17 19:08:20,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.802e+02 5.443e+02 8.160e+02 1.939e+03, threshold=1.089e+03, percent-clipped=11.0 2023-06-17 19:08:36,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19860.0, ans=0.10140000000000002 2023-06-17 19:09:57,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=20040.0, ans=0.2 2023-06-17 19:10:04,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=20100.0, ans=0.125 2023-06-17 19:10:20,567 INFO [train.py:996] (0/4) Epoch 1, batch 3350, loss[loss=0.4347, simple_loss=0.444, pruned_loss=0.2127, over 21485.00 frames. ], tot_loss[loss=0.4207, simple_loss=0.4388, pruned_loss=0.2013, over 4268182.82 frames. ], batch size: 194, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 19:10:25,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=20100.0, ans=0.07 2023-06-17 19:11:03,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20160.0, ans=0.1 2023-06-17 19:11:28,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=20220.0, ans=0.2 2023-06-17 19:12:27,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-17 19:12:43,815 INFO [train.py:996] (0/4) Epoch 1, batch 3400, loss[loss=0.4256, simple_loss=0.4216, pruned_loss=0.2148, over 20137.00 frames. ], tot_loss[loss=0.4183, simple_loss=0.4368, pruned_loss=0.1999, over 4275650.71 frames. ], batch size: 702, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 19:13:07,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 4.511e+02 6.532e+02 8.905e+02 1.651e+03, threshold=1.306e+03, percent-clipped=8.0 2023-06-17 19:13:16,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-17 19:13:41,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=20520.0, ans=0.0 2023-06-17 19:13:56,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20580.0, ans=0.1 2023-06-17 19:14:58,844 INFO [train.py:996] (0/4) Epoch 1, batch 3450, loss[loss=0.38, simple_loss=0.4001, pruned_loss=0.18, over 21807.00 frames. ], tot_loss[loss=0.4114, simple_loss=0.4296, pruned_loss=0.1965, over 4279848.66 frames. ], batch size: 107, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 19:15:23,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-17 19:15:37,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=20760.0, ans=0.125 2023-06-17 19:15:57,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=20820.0, ans=0.2 2023-06-17 19:17:13,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.76 vs. limit=22.5 2023-06-17 19:17:20,354 INFO [train.py:996] (0/4) Epoch 1, batch 3500, loss[loss=0.5163, simple_loss=0.506, pruned_loss=0.2633, over 21930.00 frames. ], tot_loss[loss=0.4225, simple_loss=0.4406, pruned_loss=0.2022, over 4266957.01 frames. ], batch size: 372, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 19:17:58,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 4.045e+02 5.374e+02 7.279e+02 2.253e+03, threshold=1.075e+03, percent-clipped=5.0 2023-06-17 19:19:42,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=21240.0, ans=0.0 2023-06-17 19:19:45,180 INFO [train.py:996] (0/4) Epoch 1, batch 3550, loss[loss=0.3826, simple_loss=0.3804, pruned_loss=0.1924, over 20190.00 frames. ], tot_loss[loss=0.4242, simple_loss=0.4426, pruned_loss=0.2029, over 4258790.37 frames. ], batch size: 703, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 19:20:24,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=21360.0, ans=0.0 2023-06-17 19:20:30,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=21360.0, ans=0.006226086956521739 2023-06-17 19:20:31,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=21360.0, ans=0.2 2023-06-17 19:20:33,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=21360.0, ans=0.025 2023-06-17 19:20:47,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=12.0 2023-06-17 19:21:28,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21480.0, ans=0.125 2023-06-17 19:21:49,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=21540.0, ans=0.125 2023-06-17 19:21:53,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=21540.0, ans=0.125 2023-06-17 19:21:59,193 INFO [train.py:996] (0/4) Epoch 1, batch 3600, loss[loss=0.3423, simple_loss=0.3954, pruned_loss=0.1446, over 16452.00 frames. ], tot_loss[loss=0.4188, simple_loss=0.4365, pruned_loss=0.2005, over 4256710.02 frames. ], batch size: 63, lr: 4.27e-02, grad_scale: 16.0 2023-06-17 19:22:38,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21660.0, ans=0.125 2023-06-17 19:22:43,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.884e+02 5.320e+02 7.505e+02 1.580e+03, threshold=1.064e+03, percent-clipped=11.0 2023-06-17 19:23:53,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-17 19:24:08,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=21840.0, ans=0.125 2023-06-17 19:24:43,922 INFO [train.py:996] (0/4) Epoch 1, batch 3650, loss[loss=0.5287, simple_loss=0.5435, pruned_loss=0.257, over 21559.00 frames. ], tot_loss[loss=0.4206, simple_loss=0.4391, pruned_loss=0.201, over 4264559.45 frames. ], batch size: 263, lr: 4.27e-02, grad_scale: 16.0 2023-06-17 19:24:45,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=22.5 2023-06-17 19:25:02,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.48 vs. limit=22.5 2023-06-17 19:25:05,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21960.0, ans=0.125 2023-06-17 19:25:53,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=22080.0, ans=0.125 2023-06-17 19:26:07,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22080.0, ans=0.125 2023-06-17 19:26:14,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=22080.0, ans=0.2 2023-06-17 19:26:24,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.44 vs. limit=15.0 2023-06-17 19:26:52,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=22140.0, ans=0.0 2023-06-17 19:26:58,876 INFO [train.py:996] (0/4) Epoch 1, batch 3700, loss[loss=0.4143, simple_loss=0.4382, pruned_loss=0.1952, over 21902.00 frames. ], tot_loss[loss=0.4174, simple_loss=0.4373, pruned_loss=0.1987, over 4270526.92 frames. ], batch size: 351, lr: 4.26e-02, grad_scale: 16.0 2023-06-17 19:27:30,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 4.188e+02 5.889e+02 8.889e+02 2.124e+03, threshold=1.178e+03, percent-clipped=16.0 2023-06-17 19:27:36,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22260.0, ans=0.0 2023-06-17 19:27:38,058 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:27:41,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-17 19:29:19,301 INFO [train.py:996] (0/4) Epoch 1, batch 3750, loss[loss=0.4161, simple_loss=0.4851, pruned_loss=0.1735, over 20904.00 frames. ], tot_loss[loss=0.4118, simple_loss=0.4324, pruned_loss=0.1956, over 4273662.95 frames. ], batch size: 608, lr: 4.26e-02, grad_scale: 16.0 2023-06-17 19:29:28,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=22500.0, ans=0.125 2023-06-17 19:29:54,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=22560.0, ans=0.2 2023-06-17 19:29:57,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=22560.0, ans=0.125 2023-06-17 19:30:31,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=22620.0, ans=0.125 2023-06-17 19:31:54,519 INFO [train.py:996] (0/4) Epoch 1, batch 3800, loss[loss=0.4564, simple_loss=0.4674, pruned_loss=0.2227, over 21309.00 frames. ], tot_loss[loss=0.4096, simple_loss=0.4312, pruned_loss=0.194, over 4275083.98 frames. ], batch size: 143, lr: 4.25e-02, grad_scale: 16.0 2023-06-17 19:31:58,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=22800.0, ans=0.125 2023-06-17 19:32:08,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22860.0, ans=0.0 2023-06-17 19:32:09,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=22860.0, ans=0.125 2023-06-17 19:32:11,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.353e+02 4.457e+02 7.554e+02 3.391e+03, threshold=8.914e+02, percent-clipped=13.0 2023-06-17 19:33:51,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=23040.0, ans=0.125 2023-06-17 19:33:58,053 INFO [train.py:996] (0/4) Epoch 1, batch 3850, loss[loss=0.3648, simple_loss=0.3745, pruned_loss=0.1776, over 21584.00 frames. ], tot_loss[loss=0.4057, simple_loss=0.4263, pruned_loss=0.1925, over 4283068.81 frames. ], batch size: 298, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 19:35:17,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-17 19:35:40,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=23280.0, ans=0.125 2023-06-17 19:36:10,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=23340.0, ans=0.2 2023-06-17 19:36:14,346 INFO [train.py:996] (0/4) Epoch 1, batch 3900, loss[loss=0.4325, simple_loss=0.4375, pruned_loss=0.2137, over 21862.00 frames. ], tot_loss[loss=0.4002, simple_loss=0.4203, pruned_loss=0.19, over 4279066.32 frames. ], batch size: 414, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 19:36:29,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=23400.0, ans=0.0 2023-06-17 19:36:30,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=23400.0, ans=0.125 2023-06-17 19:36:37,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-17 19:36:38,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.460e+02 4.927e+02 7.077e+02 1.688e+03, threshold=9.853e+02, percent-clipped=16.0 2023-06-17 19:37:26,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.35 vs. limit=22.5 2023-06-17 19:37:47,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=23580.0, ans=0.005743478260869565 2023-06-17 19:37:59,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=23580.0, ans=0.125 2023-06-17 19:38:38,971 INFO [train.py:996] (0/4) Epoch 1, batch 3950, loss[loss=0.3611, simple_loss=0.4048, pruned_loss=0.1587, over 21807.00 frames. ], tot_loss[loss=0.398, simple_loss=0.4198, pruned_loss=0.1881, over 4274446.44 frames. ], batch size: 371, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 19:38:55,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=23760.0, ans=0.005704347826086957 2023-06-17 19:39:11,315 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:39:14,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=23760.0, ans=0.125 2023-06-17 19:40:01,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.62 vs. limit=10.0 2023-06-17 19:40:36,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-17 19:40:40,519 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-17 19:40:56,692 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-4000.pt 2023-06-17 19:41:01,648 INFO [train.py:996] (0/4) Epoch 1, batch 4000, loss[loss=0.3885, simple_loss=0.3882, pruned_loss=0.1944, over 21452.00 frames. ], tot_loss[loss=0.39, simple_loss=0.4136, pruned_loss=0.1832, over 4274794.01 frames. ], batch size: 441, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 19:41:15,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=24060.0, ans=0.0 2023-06-17 19:41:35,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.734e+02 4.906e+02 6.607e+02 1.436e+03, threshold=9.812e+02, percent-clipped=4.0 2023-06-17 19:41:51,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=24060.0, ans=0.1 2023-06-17 19:42:04,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24120.0, ans=0.1 2023-06-17 19:43:15,717 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:43:26,361 INFO [train.py:996] (0/4) Epoch 1, batch 4050, loss[loss=0.3028, simple_loss=0.3671, pruned_loss=0.1192, over 21690.00 frames. ], tot_loss[loss=0.3867, simple_loss=0.4135, pruned_loss=0.1799, over 4278919.15 frames. ], batch size: 247, lr: 4.22e-02, grad_scale: 4.0 2023-06-17 19:43:30,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24300.0, ans=0.1 2023-06-17 19:43:32,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24300.0, ans=0.1 2023-06-17 19:44:35,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=24420.0, ans=0.125 2023-06-17 19:45:25,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=24540.0, ans=0.125 2023-06-17 19:45:32,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.90 vs. limit=22.5 2023-06-17 19:45:38,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=24540.0, ans=0.2 2023-06-17 19:45:42,499 INFO [train.py:996] (0/4) Epoch 1, batch 4100, loss[loss=0.3675, simple_loss=0.3944, pruned_loss=0.1703, over 21901.00 frames. ], tot_loss[loss=0.3897, simple_loss=0.4148, pruned_loss=0.1823, over 4284162.08 frames. ], batch size: 332, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:46:03,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=24600.0, ans=0.005521739130434783 2023-06-17 19:46:34,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.802e+02 5.077e+02 7.572e+02 1.841e+03, threshold=1.015e+03, percent-clipped=11.0 2023-06-17 19:47:30,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=24780.0, ans=0.5 2023-06-17 19:47:50,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2023-06-17 19:47:51,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=24840.0, ans=0.0 2023-06-17 19:47:53,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=24840.0, ans=0.125 2023-06-17 19:48:02,816 INFO [train.py:996] (0/4) Epoch 1, batch 4150, loss[loss=0.3103, simple_loss=0.3571, pruned_loss=0.1318, over 21337.00 frames. ], tot_loss[loss=0.3856, simple_loss=0.4154, pruned_loss=0.1779, over 4274323.66 frames. ], batch size: 131, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:49:01,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=25020.0, ans=0.005430434782608695 2023-06-17 19:49:31,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=25080.0, ans=0.0 2023-06-17 19:49:33,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=25080.0, ans=0.125 2023-06-17 19:49:34,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=25080.0, ans=0.0 2023-06-17 19:50:18,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=25140.0, ans=0.125 2023-06-17 19:50:32,236 INFO [train.py:996] (0/4) Epoch 1, batch 4200, loss[loss=0.476, simple_loss=0.4747, pruned_loss=0.2386, over 21385.00 frames. ], tot_loss[loss=0.3862, simple_loss=0.4161, pruned_loss=0.1782, over 4271697.67 frames. ], batch size: 548, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:51:05,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=25200.0, ans=0.2 2023-06-17 19:51:08,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=25200.0, ans=0.2 2023-06-17 19:51:18,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.271e+02 4.382e+02 6.312e+02 1.234e+03, threshold=8.764e+02, percent-clipped=8.0 2023-06-17 19:51:24,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-17 19:51:42,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=25320.0, ans=0.125 2023-06-17 19:52:30,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=25440.0, ans=0.2 2023-06-17 19:52:33,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=25440.0, ans=0.0 2023-06-17 19:52:33,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=25440.0, ans=0.2 2023-06-17 19:53:04,745 INFO [train.py:996] (0/4) Epoch 1, batch 4250, loss[loss=0.5491, simple_loss=0.5416, pruned_loss=0.2783, over 21342.00 frames. ], tot_loss[loss=0.3939, simple_loss=0.4233, pruned_loss=0.1822, over 4267681.76 frames. ], batch size: 507, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:53:06,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=26.55 vs. limit=15.0 2023-06-17 19:53:30,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25500.0, ans=0.1 2023-06-17 19:53:30,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.50 vs. limit=22.5 2023-06-17 19:53:31,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=25560.0, ans=0.125 2023-06-17 19:53:36,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25560.0, ans=0.1 2023-06-17 19:54:32,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=25680.0, ans=0.00528695652173913 2023-06-17 19:55:33,833 INFO [train.py:996] (0/4) Epoch 1, batch 4300, loss[loss=0.4399, simple_loss=0.4857, pruned_loss=0.197, over 21527.00 frames. ], tot_loss[loss=0.4018, simple_loss=0.4317, pruned_loss=0.186, over 4261736.47 frames. ], batch size: 471, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:55:34,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-17 19:55:40,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=25800.0, ans=0.125 2023-06-17 19:55:51,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.10 vs. limit=22.5 2023-06-17 19:56:30,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 4.361e+02 6.749e+02 9.023e+02 1.594e+03, threshold=1.350e+03, percent-clipped=28.0 2023-06-17 19:56:34,061 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:56:35,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=25860.0, ans=0.005247826086956522 2023-06-17 19:56:38,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=25920.0, ans=0.125 2023-06-17 19:56:50,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25920.0, ans=0.1 2023-06-17 19:57:54,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-17 19:57:55,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=26040.0, ans=0.005208695652173913 2023-06-17 19:58:04,460 INFO [train.py:996] (0/4) Epoch 1, batch 4350, loss[loss=0.4096, simple_loss=0.4164, pruned_loss=0.2014, over 21458.00 frames. ], tot_loss[loss=0.3973, simple_loss=0.4274, pruned_loss=0.1836, over 4254615.20 frames. ], batch size: 389, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:58:25,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=26160.0, ans=0.125 2023-06-17 19:58:36,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=26160.0, ans=0.005182608695652174 2023-06-17 19:58:47,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=26220.0, ans=0.125 2023-06-17 20:00:01,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=26340.0, ans=0.0 2023-06-17 20:00:06,830 INFO [train.py:996] (0/4) Epoch 1, batch 4400, loss[loss=0.4612, simple_loss=0.4617, pruned_loss=0.2303, over 19959.00 frames. ], tot_loss[loss=0.3925, simple_loss=0.4214, pruned_loss=0.1818, over 4251241.09 frames. ], batch size: 702, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 20:00:15,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=26400.0, ans=0.0 2023-06-17 20:00:53,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 3.739e+02 5.540e+02 6.939e+02 1.405e+03, threshold=1.108e+03, percent-clipped=1.0 2023-06-17 20:01:32,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=26520.0, ans=0.125 2023-06-17 20:01:38,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26580.0, ans=0.1 2023-06-17 20:02:38,865 INFO [train.py:996] (0/4) Epoch 1, batch 4450, loss[loss=0.4689, simple_loss=0.4945, pruned_loss=0.2216, over 21775.00 frames. ], tot_loss[loss=0.3954, simple_loss=0.4274, pruned_loss=0.1817, over 4255154.49 frames. ], batch size: 414, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 20:02:40,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=26700.0, ans=0.125 2023-06-17 20:03:10,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=26760.0, ans=0.125 2023-06-17 20:03:46,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=26820.0, ans=0.125 2023-06-17 20:04:00,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-17 20:04:47,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=27000.0, ans=0.07 2023-06-17 20:04:47,886 INFO [train.py:996] (0/4) Epoch 1, batch 4500, loss[loss=0.463, simple_loss=0.4846, pruned_loss=0.2207, over 21630.00 frames. ], tot_loss[loss=0.4006, simple_loss=0.43, pruned_loss=0.1856, over 4261652.54 frames. ], batch size: 471, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 20:04:51,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27000.0, ans=0.1 2023-06-17 20:05:22,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=27060.0, ans=0.2 2023-06-17 20:05:39,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=27060.0, ans=0.02 2023-06-17 20:05:40,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.441e+02 4.481e+02 5.907e+02 7.861e+02 1.389e+03, threshold=1.181e+03, percent-clipped=9.0 2023-06-17 20:06:08,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=27180.0, ans=0.125 2023-06-17 20:06:09,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=27180.0, ans=0.125 2023-06-17 20:07:14,396 INFO [train.py:996] (0/4) Epoch 1, batch 4550, loss[loss=0.482, simple_loss=0.49, pruned_loss=0.237, over 21603.00 frames. ], tot_loss[loss=0.4036, simple_loss=0.4339, pruned_loss=0.1866, over 4264610.22 frames. ], batch size: 389, lr: 4.16e-02, grad_scale: 4.0 2023-06-17 20:07:56,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=27360.0, ans=0.0 2023-06-17 20:09:37,114 INFO [train.py:996] (0/4) Epoch 1, batch 4600, loss[loss=0.3411, simple_loss=0.3799, pruned_loss=0.1511, over 21394.00 frames. ], tot_loss[loss=0.4021, simple_loss=0.4335, pruned_loss=0.1854, over 4267804.05 frames. ], batch size: 211, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 20:10:27,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.841e+02 4.647e+02 5.664e+02 1.586e+03, threshold=9.294e+02, percent-clipped=2.0 2023-06-17 20:10:29,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=27660.0, ans=0.0 2023-06-17 20:12:02,318 INFO [train.py:996] (0/4) Epoch 1, batch 4650, loss[loss=0.3032, simple_loss=0.3404, pruned_loss=0.133, over 21766.00 frames. ], tot_loss[loss=0.3915, simple_loss=0.4228, pruned_loss=0.1801, over 4276019.98 frames. ], batch size: 247, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 20:13:28,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-17 20:13:49,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.26 vs. limit=6.0 2023-06-17 20:13:54,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=28080.0, ans=0.0 2023-06-17 20:14:03,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-17 20:14:13,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=28200.0, ans=0.2 2023-06-17 20:14:13,977 INFO [train.py:996] (0/4) Epoch 1, batch 4700, loss[loss=0.3268, simple_loss=0.3578, pruned_loss=0.148, over 21741.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.4119, pruned_loss=0.1764, over 4274033.40 frames. ], batch size: 112, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 20:15:16,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 4.151e+02 4.936e+02 6.766e+02 1.742e+03, threshold=9.871e+02, percent-clipped=9.0 2023-06-17 20:16:30,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28500.0, ans=0.1 2023-06-17 20:16:31,465 INFO [train.py:996] (0/4) Epoch 1, batch 4750, loss[loss=0.4202, simple_loss=0.4533, pruned_loss=0.1935, over 20767.00 frames. ], tot_loss[loss=0.3791, simple_loss=0.4059, pruned_loss=0.1762, over 4277213.28 frames. ], batch size: 608, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 20:17:58,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=28680.0, ans=0.125 2023-06-17 20:18:36,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-17 20:18:41,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-17 20:19:08,691 INFO [train.py:996] (0/4) Epoch 1, batch 4800, loss[loss=0.3736, simple_loss=0.3983, pruned_loss=0.1745, over 21374.00 frames. ], tot_loss[loss=0.3802, simple_loss=0.4066, pruned_loss=0.1768, over 4280047.95 frames. ], batch size: 131, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 20:19:13,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=28800.0, ans=0.05 2023-06-17 20:19:21,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=28800.0, ans=0.2 2023-06-17 20:19:31,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28860.0, ans=0.1 2023-06-17 20:19:33,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.275e+02 5.086e+02 6.755e+02 1.816e+03, threshold=1.017e+03, percent-clipped=8.0 2023-06-17 20:20:09,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=28920.0, ans=15.0 2023-06-17 20:20:24,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28980.0, ans=0.125 2023-06-17 20:20:36,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=29040.0, ans=0.125 2023-06-17 20:21:04,166 INFO [train.py:996] (0/4) Epoch 1, batch 4850, loss[loss=0.4361, simple_loss=0.4867, pruned_loss=0.1927, over 19962.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.4072, pruned_loss=0.1769, over 4279125.38 frames. ], batch size: 703, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 20:21:29,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-17 20:21:43,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=29160.0, ans=0.1 2023-06-17 20:21:43,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-17 20:22:44,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=29280.0, ans=0.2 2023-06-17 20:23:32,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=29340.0, ans=0.125 2023-06-17 20:23:36,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=29400.0, ans=0.004478260869565218 2023-06-17 20:23:37,873 INFO [train.py:996] (0/4) Epoch 1, batch 4900, loss[loss=0.423, simple_loss=0.43, pruned_loss=0.208, over 21602.00 frames. ], tot_loss[loss=0.3831, simple_loss=0.4096, pruned_loss=0.1783, over 4282637.68 frames. ], batch size: 548, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 20:23:47,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.37 vs. limit=22.5 2023-06-17 20:24:01,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=29460.0, ans=0.004465217391304348 2023-06-17 20:24:08,470 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 3.410e+02 4.347e+02 5.276e+02 1.356e+03, threshold=8.693e+02, percent-clipped=2.0 2023-06-17 20:25:57,160 INFO [train.py:996] (0/4) Epoch 1, batch 4950, loss[loss=0.3395, simple_loss=0.4104, pruned_loss=0.1343, over 21833.00 frames. ], tot_loss[loss=0.3814, simple_loss=0.4125, pruned_loss=0.1751, over 4284337.39 frames. ], batch size: 317, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 20:26:03,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=29700.0, ans=0.0 2023-06-17 20:26:04,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=29700.0, ans=0.0044130434782608694 2023-06-17 20:26:59,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-17 20:27:07,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=29820.0, ans=0.0 2023-06-17 20:27:23,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-17 20:28:07,664 INFO [train.py:996] (0/4) Epoch 1, batch 5000, loss[loss=0.4275, simple_loss=0.4266, pruned_loss=0.2143, over 21492.00 frames. ], tot_loss[loss=0.3751, simple_loss=0.4113, pruned_loss=0.1695, over 4279795.53 frames. ], batch size: 548, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 20:28:21,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-17 20:28:42,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.507e+02 4.413e+02 5.456e+02 1.135e+03, threshold=8.826e+02, percent-clipped=2.0 2023-06-17 20:29:17,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=30120.0, ans=0.025 2023-06-17 20:29:29,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=30180.0, ans=0.125 2023-06-17 20:29:32,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=30180.0, ans=0.125 2023-06-17 20:29:33,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=30180.0, ans=0.0 2023-06-17 20:29:33,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=30180.0, ans=0.125 2023-06-17 20:30:10,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=30300.0, ans=0.004282608695652174 2023-06-17 20:30:11,510 INFO [train.py:996] (0/4) Epoch 1, batch 5050, loss[loss=0.3269, simple_loss=0.3843, pruned_loss=0.1347, over 21638.00 frames. ], tot_loss[loss=0.3788, simple_loss=0.413, pruned_loss=0.1723, over 4286570.36 frames. ], batch size: 263, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 20:30:35,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30300.0, ans=0.1 2023-06-17 20:30:59,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=30360.0, ans=0.125 2023-06-17 20:32:32,733 INFO [train.py:996] (0/4) Epoch 1, batch 5100, loss[loss=0.3537, simple_loss=0.3778, pruned_loss=0.1648, over 21648.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4129, pruned_loss=0.1735, over 4289249.46 frames. ], batch size: 263, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 20:33:25,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 3.812e+02 4.644e+02 6.399e+02 1.305e+03, threshold=9.287e+02, percent-clipped=10.0 2023-06-17 20:34:31,283 INFO [train.py:996] (0/4) Epoch 1, batch 5150, loss[loss=0.3959, simple_loss=0.4231, pruned_loss=0.1844, over 21864.00 frames. ], tot_loss[loss=0.3792, simple_loss=0.4116, pruned_loss=0.1734, over 4285988.29 frames. ], batch size: 371, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 20:34:45,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=30960.0, ans=0.0 2023-06-17 20:35:00,773 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-17 20:35:19,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-17 20:36:03,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2023-06-17 20:36:28,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31140.0, ans=0.125 2023-06-17 20:36:52,078 INFO [train.py:996] (0/4) Epoch 1, batch 5200, loss[loss=0.3937, simple_loss=0.458, pruned_loss=0.1647, over 21862.00 frames. ], tot_loss[loss=0.383, simple_loss=0.4151, pruned_loss=0.1755, over 4290748.74 frames. ], batch size: 316, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 20:36:57,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-17 20:37:10,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=31200.0, ans=0.04949747468305833 2023-06-17 20:37:49,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=31260.0, ans=0.125 2023-06-17 20:37:49,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31260.0, ans=0.125 2023-06-17 20:37:49,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 3.911e+02 4.920e+02 6.306e+02 1.130e+03, threshold=9.840e+02, percent-clipped=5.0 2023-06-17 20:37:52,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=22.5 2023-06-17 20:37:52,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-17 20:37:54,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=31320.0, ans=0.125 2023-06-17 20:38:17,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31380.0, ans=0.1 2023-06-17 20:38:21,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.68 vs. limit=15.0 2023-06-17 20:38:33,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=31440.0, ans=0.125 2023-06-17 20:39:01,688 INFO [train.py:996] (0/4) Epoch 1, batch 5250, loss[loss=0.3502, simple_loss=0.3856, pruned_loss=0.1574, over 21831.00 frames. ], tot_loss[loss=0.3789, simple_loss=0.4159, pruned_loss=0.1709, over 4288023.14 frames. ], batch size: 107, lr: 4.07e-02, grad_scale: 32.0 2023-06-17 20:39:21,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-17 20:40:14,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=31620.0, ans=0.07 2023-06-17 20:40:17,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=31620.0, ans=0.003995652173913044 2023-06-17 20:41:01,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=31740.0, ans=0.003969565217391304 2023-06-17 20:41:18,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-17 20:41:23,878 INFO [train.py:996] (0/4) Epoch 1, batch 5300, loss[loss=0.3832, simple_loss=0.4107, pruned_loss=0.1778, over 21857.00 frames. ], tot_loss[loss=0.3814, simple_loss=0.4169, pruned_loss=0.1729, over 4295057.41 frames. ], batch size: 351, lr: 4.07e-02, grad_scale: 32.0 2023-06-17 20:41:25,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=31800.0, ans=0.125 2023-06-17 20:42:02,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=31800.0, ans=0.0 2023-06-17 20:42:23,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-17 20:42:26,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.571e+02 4.372e+02 6.573e+02 1.564e+03, threshold=8.743e+02, percent-clipped=7.0 2023-06-17 20:42:27,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-17 20:42:58,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31980.0, ans=0.125 2023-06-17 20:43:26,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=32040.0, ans=0.0 2023-06-17 20:43:47,264 INFO [train.py:996] (0/4) Epoch 1, batch 5350, loss[loss=0.3791, simple_loss=0.3988, pruned_loss=0.1797, over 21847.00 frames. ], tot_loss[loss=0.3837, simple_loss=0.4167, pruned_loss=0.1753, over 4294915.01 frames. ], batch size: 298, lr: 4.06e-02, grad_scale: 32.0 2023-06-17 20:44:45,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=32220.0, ans=0.025 2023-06-17 20:44:54,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=32220.0, ans=0.003865217391304348 2023-06-17 20:45:15,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=32280.0, ans=0.0 2023-06-17 20:45:15,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=32280.0, ans=0.125 2023-06-17 20:45:31,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=32340.0, ans=0.025 2023-06-17 20:46:12,770 INFO [train.py:996] (0/4) Epoch 1, batch 5400, loss[loss=0.4952, simple_loss=0.5355, pruned_loss=0.2275, over 21235.00 frames. ], tot_loss[loss=0.3856, simple_loss=0.4166, pruned_loss=0.1773, over 4300780.66 frames. ], batch size: 548, lr: 4.05e-02, grad_scale: 32.0 2023-06-17 20:46:47,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32460.0, ans=0.125 2023-06-17 20:46:52,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=32460.0, ans=0.125 2023-06-17 20:46:57,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 3.975e+02 4.705e+02 6.164e+02 1.546e+03, threshold=9.411e+02, percent-clipped=5.0 2023-06-17 20:47:26,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=32580.0, ans=0.125 2023-06-17 20:47:28,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=32580.0, ans=0.2 2023-06-17 20:47:36,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=32580.0, ans=0.125 2023-06-17 20:47:40,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=32580.0, ans=0.0 2023-06-17 20:47:45,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32640.0, ans=0.1 2023-06-17 20:48:16,836 INFO [train.py:996] (0/4) Epoch 1, batch 5450, loss[loss=0.338, simple_loss=0.4063, pruned_loss=0.1349, over 21374.00 frames. ], tot_loss[loss=0.3821, simple_loss=0.4152, pruned_loss=0.1745, over 4303618.71 frames. ], batch size: 194, lr: 4.05e-02, grad_scale: 32.0 2023-06-17 20:48:43,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=32700.0, ans=0.125 2023-06-17 20:49:05,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=32820.0, ans=0.035 2023-06-17 20:49:27,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=32820.0, ans=0.0 2023-06-17 20:50:39,975 INFO [train.py:996] (0/4) Epoch 1, batch 5500, loss[loss=0.3873, simple_loss=0.4619, pruned_loss=0.1564, over 21776.00 frames. ], tot_loss[loss=0.3758, simple_loss=0.414, pruned_loss=0.1688, over 4292745.78 frames. ], batch size: 351, lr: 4.04e-02, grad_scale: 32.0 2023-06-17 20:51:02,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=10.0 2023-06-17 20:51:30,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.468e+02 4.497e+02 5.642e+02 1.011e+03, threshold=8.995e+02, percent-clipped=2.0 2023-06-17 20:53:14,077 INFO [train.py:996] (0/4) Epoch 1, batch 5550, loss[loss=0.314, simple_loss=0.3817, pruned_loss=0.1232, over 21795.00 frames. ], tot_loss[loss=0.374, simple_loss=0.4154, pruned_loss=0.1663, over 4287660.98 frames. ], batch size: 371, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 20:53:52,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=33360.0, ans=0.0 2023-06-17 20:54:07,744 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.716e-03 2023-06-17 20:54:41,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=33480.0, ans=0.0 2023-06-17 20:55:09,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=33540.0, ans=0.95 2023-06-17 20:55:13,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-17 20:55:30,608 INFO [train.py:996] (0/4) Epoch 1, batch 5600, loss[loss=0.4284, simple_loss=0.4785, pruned_loss=0.1892, over 21836.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.4136, pruned_loss=0.1623, over 4285925.73 frames. ], batch size: 371, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 20:56:05,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.459e+02 4.877e+02 6.636e+02 1.371e+03, threshold=9.753e+02, percent-clipped=8.0 2023-06-17 20:56:08,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=33660.0, ans=0.125 2023-06-17 20:57:12,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=33840.0, ans=0.125 2023-06-17 20:57:49,975 INFO [train.py:996] (0/4) Epoch 1, batch 5650, loss[loss=0.3982, simple_loss=0.4186, pruned_loss=0.1889, over 20956.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.416, pruned_loss=0.1652, over 4292128.50 frames. ], batch size: 607, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 20:58:06,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=33900.0, ans=0.125 2023-06-17 20:58:14,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-17 20:58:31,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=22.5 2023-06-17 20:59:38,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=34140.0, ans=0.2 2023-06-17 20:59:49,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-17 21:00:01,363 INFO [train.py:996] (0/4) Epoch 1, batch 5700, loss[loss=0.353, simple_loss=0.3922, pruned_loss=0.1569, over 21711.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.416, pruned_loss=0.1668, over 4291508.81 frames. ], batch size: 247, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 21:00:15,513 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:00:17,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=34200.0, ans=0.2 2023-06-17 21:00:51,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34260.0, ans=0.1 2023-06-17 21:01:02,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.613e+02 3.406e+02 4.056e+02 5.524e+02 1.397e+03, threshold=8.113e+02, percent-clipped=5.0 2023-06-17 21:01:55,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=34440.0, ans=0.02 2023-06-17 21:01:56,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=34440.0, ans=0.125 2023-06-17 21:02:12,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=34440.0, ans=0.2 2023-06-17 21:02:26,270 INFO [train.py:996] (0/4) Epoch 1, batch 5750, loss[loss=0.3251, simple_loss=0.3775, pruned_loss=0.1363, over 21173.00 frames. ], tot_loss[loss=0.3666, simple_loss=0.4097, pruned_loss=0.1617, over 4286850.81 frames. ], batch size: 159, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 21:02:28,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-17 21:02:47,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=34500.0, ans=0.125 2023-06-17 21:03:26,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=34620.0, ans=0.0 2023-06-17 21:03:27,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=34620.0, ans=0.003343478260869566 2023-06-17 21:04:06,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=34680.0, ans=0.125 2023-06-17 21:04:53,035 INFO [train.py:996] (0/4) Epoch 1, batch 5800, loss[loss=0.3558, simple_loss=0.4215, pruned_loss=0.1451, over 21792.00 frames. ], tot_loss[loss=0.3587, simple_loss=0.4045, pruned_loss=0.1565, over 4289478.37 frames. ], batch size: 282, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 21:05:09,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34800.0, ans=0.125 2023-06-17 21:05:23,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=34860.0, ans=0.2 2023-06-17 21:05:33,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.596e+02 4.280e+02 5.879e+02 1.064e+03, threshold=8.560e+02, percent-clipped=6.0 2023-06-17 21:06:10,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=34980.0, ans=0.0 2023-06-17 21:06:23,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=34980.0, ans=0.025 2023-06-17 21:07:15,167 INFO [train.py:996] (0/4) Epoch 1, batch 5850, loss[loss=0.3846, simple_loss=0.4519, pruned_loss=0.1587, over 21188.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3976, pruned_loss=0.1475, over 4286464.92 frames. ], batch size: 548, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 21:07:27,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=35100.0, ans=0.125 2023-06-17 21:07:34,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=35160.0, ans=0.125 2023-06-17 21:07:37,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=35160.0, ans=0.035 2023-06-17 21:07:40,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.06 vs. limit=6.0 2023-06-17 21:09:28,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=35400.0, ans=0.125 2023-06-17 21:09:29,299 INFO [train.py:996] (0/4) Epoch 1, batch 5900, loss[loss=0.2376, simple_loss=0.3089, pruned_loss=0.08314, over 21284.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3848, pruned_loss=0.1367, over 4279360.81 frames. ], batch size: 176, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 21:10:06,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.990e+02 3.626e+02 5.703e+02 1.926e+03, threshold=7.252e+02, percent-clipped=11.0 2023-06-17 21:10:32,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35580.0, ans=0.0 2023-06-17 21:10:34,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=35580.0, ans=0.0 2023-06-17 21:10:42,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=35580.0, ans=0.025 2023-06-17 21:11:03,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-17 21:11:09,789 INFO [train.py:996] (0/4) Epoch 1, batch 5950, loss[loss=0.3956, simple_loss=0.42, pruned_loss=0.1856, over 15571.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.3899, pruned_loss=0.147, over 4283662.99 frames. ], batch size: 61, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 21:11:38,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=35700.0, ans=0.125 2023-06-17 21:12:18,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-17 21:12:47,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-17 21:13:16,710 INFO [train.py:996] (0/4) Epoch 1, batch 6000, loss[loss=0.3227, simple_loss=0.3473, pruned_loss=0.149, over 21411.00 frames. ], tot_loss[loss=0.3483, simple_loss=0.3895, pruned_loss=0.1535, over 4287140.45 frames. ], batch size: 195, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 21:13:16,712 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 21:14:09,294 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3443, simple_loss=0.428, pruned_loss=0.1303, over 1796401.00 frames. 2023-06-17 21:14:09,296 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23584MB 2023-06-17 21:14:40,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 3.654e+02 4.651e+02 5.653e+02 9.533e+02, threshold=9.302e+02, percent-clipped=10.0 2023-06-17 21:15:27,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=36180.0, ans=0.125 2023-06-17 21:16:08,863 INFO [train.py:996] (0/4) Epoch 1, batch 6050, loss[loss=0.2801, simple_loss=0.3216, pruned_loss=0.1193, over 21597.00 frames. ], tot_loss[loss=0.349, simple_loss=0.3873, pruned_loss=0.1554, over 4279450.26 frames. ], batch size: 263, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 21:17:06,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=36420.0, ans=10.0 2023-06-17 21:17:21,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=36420.0, ans=0.05 2023-06-17 21:17:44,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36480.0, ans=0.1 2023-06-17 21:17:49,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-17 21:18:23,273 INFO [train.py:996] (0/4) Epoch 1, batch 6100, loss[loss=0.3534, simple_loss=0.3909, pruned_loss=0.158, over 21818.00 frames. ], tot_loss[loss=0.3496, simple_loss=0.388, pruned_loss=0.1556, over 4272695.88 frames. ], batch size: 282, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 21:19:00,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.419e+02 4.285e+02 5.583e+02 1.372e+03, threshold=8.569e+02, percent-clipped=6.0 2023-06-17 21:19:03,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36720.0, ans=0.1 2023-06-17 21:19:10,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=36720.0, ans=0.0028869565217391306 2023-06-17 21:19:14,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=36720.0, ans=0.0 2023-06-17 21:19:32,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=36780.0, ans=0.125 2023-06-17 21:19:49,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=36780.0, ans=0.125 2023-06-17 21:20:17,369 INFO [train.py:996] (0/4) Epoch 1, batch 6150, loss[loss=0.3451, simple_loss=0.3856, pruned_loss=0.1522, over 21713.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3898, pruned_loss=0.1593, over 4272655.44 frames. ], batch size: 298, lr: 3.96e-02, grad_scale: 16.0 2023-06-17 21:20:41,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36900.0, ans=0.1 2023-06-17 21:21:10,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=37020.0, ans=0.1 2023-06-17 21:22:41,972 INFO [train.py:996] (0/4) Epoch 1, batch 6200, loss[loss=0.3272, simple_loss=0.38, pruned_loss=0.1372, over 21392.00 frames. ], tot_loss[loss=0.3549, simple_loss=0.392, pruned_loss=0.1589, over 4273782.57 frames. ], batch size: 211, lr: 3.95e-02, grad_scale: 16.0 2023-06-17 21:22:48,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=37200.0, ans=0.125 2023-06-17 21:22:49,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=37200.0, ans=0.125 2023-06-17 21:23:13,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.238e+02 4.206e+02 5.371e+02 1.012e+03, threshold=8.413e+02, percent-clipped=2.0 2023-06-17 21:23:33,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=37320.0, ans=0.0 2023-06-17 21:24:47,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37440.0, ans=0.125 2023-06-17 21:24:54,229 INFO [train.py:996] (0/4) Epoch 1, batch 6250, loss[loss=0.4907, simple_loss=0.5211, pruned_loss=0.2302, over 19782.00 frames. ], tot_loss[loss=0.3577, simple_loss=0.3969, pruned_loss=0.1592, over 4269154.85 frames. ], batch size: 703, lr: 3.94e-02, grad_scale: 16.0 2023-06-17 21:24:57,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=37500.0, ans=22.5 2023-06-17 21:25:27,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=37560.0, ans=0.0 2023-06-17 21:26:28,351 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:26:45,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.60 vs. limit=22.5 2023-06-17 21:26:56,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=37740.0, ans=0.07 2023-06-17 21:27:20,508 INFO [train.py:996] (0/4) Epoch 1, batch 6300, loss[loss=0.3405, simple_loss=0.3953, pruned_loss=0.1428, over 17000.00 frames. ], tot_loss[loss=0.3588, simple_loss=0.4013, pruned_loss=0.1582, over 4270239.29 frames. ], batch size: 60, lr: 3.94e-02, grad_scale: 16.0 2023-06-17 21:28:10,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.757e+02 4.825e+02 6.859e+02 1.465e+03, threshold=9.649e+02, percent-clipped=15.0 2023-06-17 21:28:21,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37920.0, ans=0.1 2023-06-17 21:28:48,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37980.0, ans=0.125 2023-06-17 21:29:27,497 INFO [train.py:996] (0/4) Epoch 1, batch 6350, loss[loss=0.4111, simple_loss=0.4313, pruned_loss=0.1954, over 21756.00 frames. ], tot_loss[loss=0.371, simple_loss=0.4093, pruned_loss=0.1663, over 4278556.79 frames. ], batch size: 298, lr: 3.93e-02, grad_scale: 16.0 2023-06-17 21:30:57,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=38280.0, ans=0.07 2023-06-17 21:31:11,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38340.0, ans=0.1 2023-06-17 21:31:13,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-17 21:31:46,188 INFO [train.py:996] (0/4) Epoch 1, batch 6400, loss[loss=0.4154, simple_loss=0.4412, pruned_loss=0.1948, over 21690.00 frames. ], tot_loss[loss=0.3816, simple_loss=0.4171, pruned_loss=0.173, over 4285604.78 frames. ], batch size: 351, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 21:32:20,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-17 21:32:35,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.736e+02 4.493e+02 6.013e+02 1.011e+03, threshold=8.985e+02, percent-clipped=1.0 2023-06-17 21:33:14,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38520.0, ans=0.1 2023-06-17 21:33:49,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2023-06-17 21:34:04,666 INFO [train.py:996] (0/4) Epoch 1, batch 6450, loss[loss=0.3301, simple_loss=0.3645, pruned_loss=0.1478, over 21378.00 frames. ], tot_loss[loss=0.3796, simple_loss=0.4173, pruned_loss=0.1709, over 4284327.87 frames. ], batch size: 131, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 21:34:30,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=38760.0, ans=0.0 2023-06-17 21:34:50,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38820.0, ans=0.125 2023-06-17 21:35:20,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=38820.0, ans=0.125 2023-06-17 21:35:33,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=22.5 2023-06-17 21:36:12,925 INFO [train.py:996] (0/4) Epoch 1, batch 6500, loss[loss=0.3135, simple_loss=0.3508, pruned_loss=0.1381, over 21509.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.4073, pruned_loss=0.1681, over 4279750.76 frames. ], batch size: 230, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 21:36:45,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.75 vs. limit=22.5 2023-06-17 21:36:59,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.413e+02 4.587e+02 6.044e+02 1.414e+03, threshold=9.175e+02, percent-clipped=8.0 2023-06-17 21:38:35,543 INFO [train.py:996] (0/4) Epoch 1, batch 6550, loss[loss=0.3598, simple_loss=0.3879, pruned_loss=0.1658, over 21902.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.4061, pruned_loss=0.1668, over 4279441.07 frames. ], batch size: 316, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 21:38:35,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39300.0, ans=0.1 2023-06-17 21:39:40,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=39420.0, ans=0.125 2023-06-17 21:39:54,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-17 21:40:10,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=39540.0, ans=0.125 2023-06-17 21:40:13,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=39540.0, ans=0.125 2023-06-17 21:40:35,058 INFO [train.py:996] (0/4) Epoch 1, batch 6600, loss[loss=0.3253, simple_loss=0.3507, pruned_loss=0.1499, over 21444.00 frames. ], tot_loss[loss=0.3654, simple_loss=0.4008, pruned_loss=0.165, over 4271846.07 frames. ], batch size: 212, lr: 3.90e-02, grad_scale: 32.0 2023-06-17 21:40:40,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-17 21:41:11,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39660.0, ans=0.1 2023-06-17 21:41:23,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.711e+02 4.295e+02 5.281e+02 1.119e+03, threshold=8.590e+02, percent-clipped=2.0 2023-06-17 21:41:44,024 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:41:55,798 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.796e-03 2023-06-17 21:42:34,154 INFO [train.py:996] (0/4) Epoch 1, batch 6650, loss[loss=0.4121, simple_loss=0.4007, pruned_loss=0.2118, over 21396.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.3925, pruned_loss=0.1606, over 4275522.93 frames. ], batch size: 509, lr: 3.89e-02, grad_scale: 32.0 2023-06-17 21:42:34,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=39900.0, ans=0.125 2023-06-17 21:43:16,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=39960.0, ans=0.125 2023-06-17 21:43:31,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-17 21:43:32,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-17 21:43:34,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=40020.0, ans=0.0 2023-06-17 21:44:04,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=40080.0, ans=0.125 2023-06-17 21:44:49,223 INFO [train.py:996] (0/4) Epoch 1, batch 6700, loss[loss=0.3144, simple_loss=0.348, pruned_loss=0.1404, over 21876.00 frames. ], tot_loss[loss=0.3524, simple_loss=0.386, pruned_loss=0.1593, over 4277167.99 frames. ], batch size: 107, lr: 3.89e-02, grad_scale: 32.0 2023-06-17 21:45:25,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=40260.0, ans=0.002117391304347826 2023-06-17 21:45:41,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.533e+02 4.521e+02 6.052e+02 1.154e+03, threshold=9.041e+02, percent-clipped=5.0 2023-06-17 21:45:50,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40320.0, ans=0.1 2023-06-17 21:46:10,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=40380.0, ans=0.125 2023-06-17 21:46:31,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=40440.0, ans=0.1 2023-06-17 21:46:42,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=40440.0, ans=0.2 2023-06-17 21:46:43,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=12.0 2023-06-17 21:47:11,741 INFO [train.py:996] (0/4) Epoch 1, batch 6750, loss[loss=0.3712, simple_loss=0.3936, pruned_loss=0.1744, over 21814.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3838, pruned_loss=0.1593, over 4273898.99 frames. ], batch size: 282, lr: 3.88e-02, grad_scale: 32.0 2023-06-17 21:47:17,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-17 21:48:48,991 INFO [train.py:996] (0/4) Epoch 1, batch 6800, loss[loss=0.3386, simple_loss=0.38, pruned_loss=0.1486, over 21876.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3866, pruned_loss=0.1619, over 4267941.33 frames. ], batch size: 124, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 21:48:52,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=40800.0, ans=0.002 2023-06-17 21:49:27,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-17 21:49:38,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.387e+02 4.190e+02 5.566e+02 1.112e+03, threshold=8.380e+02, percent-clipped=6.0 2023-06-17 21:49:41,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=40920.0, ans=0.125 2023-06-17 21:49:46,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=40920.0, ans=0.035 2023-06-17 21:50:04,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40980.0, ans=0.1 2023-06-17 21:50:31,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=41040.0, ans=0.0019478260869565216 2023-06-17 21:50:36,606 INFO [train.py:996] (0/4) Epoch 1, batch 6850, loss[loss=0.3398, simple_loss=0.3639, pruned_loss=0.1578, over 21454.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.3849, pruned_loss=0.1638, over 4276146.63 frames. ], batch size: 212, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 21:50:45,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=41100.0, ans=0.0019347826086956524 2023-06-17 21:51:06,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=41160.0, ans=0.0 2023-06-17 21:51:07,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=41160.0, ans=0.125 2023-06-17 21:51:10,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=41220.0, ans=0.0 2023-06-17 21:51:28,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=41220.0, ans=0.125 2023-06-17 21:51:33,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=41220.0, ans=0.0 2023-06-17 21:52:33,797 INFO [train.py:996] (0/4) Epoch 1, batch 6900, loss[loss=0.3547, simple_loss=0.3892, pruned_loss=0.1601, over 21578.00 frames. ], tot_loss[loss=0.3561, simple_loss=0.3851, pruned_loss=0.1636, over 4281110.90 frames. ], batch size: 548, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 21:52:35,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=41400.0, ans=10.0 2023-06-17 21:52:59,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=41400.0, ans=0.125 2023-06-17 21:53:21,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-17 21:53:31,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 3.488e+02 4.127e+02 5.234e+02 1.332e+03, threshold=8.254e+02, percent-clipped=6.0 2023-06-17 21:53:53,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=22.5 2023-06-17 21:54:58,879 INFO [train.py:996] (0/4) Epoch 1, batch 6950, loss[loss=0.3535, simple_loss=0.3871, pruned_loss=0.1599, over 21634.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.3881, pruned_loss=0.1593, over 4287615.93 frames. ], batch size: 263, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 21:55:42,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41760.0, ans=0.1 2023-06-17 21:57:23,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-17 21:57:23,704 INFO [train.py:996] (0/4) Epoch 1, batch 7000, loss[loss=0.359, simple_loss=0.3829, pruned_loss=0.1675, over 21866.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.3933, pruned_loss=0.1652, over 4288065.42 frames. ], batch size: 98, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 21:57:39,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42000.0, ans=0.1 2023-06-17 21:58:00,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 4.182e+02 5.502e+02 7.553e+02 1.291e+03, threshold=1.100e+03, percent-clipped=22.0 2023-06-17 21:59:33,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=42300.0, ans=0.025 2023-06-17 21:59:34,221 INFO [train.py:996] (0/4) Epoch 1, batch 7050, loss[loss=0.3099, simple_loss=0.3556, pruned_loss=0.1321, over 21367.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.3891, pruned_loss=0.1622, over 4282680.31 frames. ], batch size: 176, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 22:00:58,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=42480.0, ans=0.2 2023-06-17 22:01:01,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=42480.0, ans=10.0 2023-06-17 22:01:20,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=42480.0, ans=0.125 2023-06-17 22:01:55,178 INFO [train.py:996] (0/4) Epoch 1, batch 7100, loss[loss=0.3694, simple_loss=0.4146, pruned_loss=0.1621, over 21815.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.3964, pruned_loss=0.1649, over 4283429.48 frames. ], batch size: 371, lr: 3.83e-02, grad_scale: 32.0 2023-06-17 22:02:02,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=42600.0, ans=0.125 2023-06-17 22:02:53,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.347e+02 4.129e+02 5.601e+02 1.207e+03, threshold=8.258e+02, percent-clipped=3.0 2023-06-17 22:03:08,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=42720.0, ans=0.1 2023-06-17 22:03:13,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-17 22:03:39,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=42780.0, ans=0.2 2023-06-17 22:03:42,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42780.0, ans=0.1 2023-06-17 22:04:11,656 INFO [train.py:996] (0/4) Epoch 1, batch 7150, loss[loss=0.3754, simple_loss=0.4078, pruned_loss=0.1715, over 21636.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3893, pruned_loss=0.1574, over 4276460.17 frames. ], batch size: 263, lr: 3.83e-02, grad_scale: 32.0 2023-06-17 22:04:12,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42900.0, ans=0.1 2023-06-17 22:04:52,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=42960.0, ans=0.04949747468305833 2023-06-17 22:05:07,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=43020.0, ans=0.2 2023-06-17 22:06:15,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=43140.0, ans=0.2 2023-06-17 22:06:33,818 INFO [train.py:996] (0/4) Epoch 1, batch 7200, loss[loss=0.33, simple_loss=0.3613, pruned_loss=0.1494, over 21624.00 frames. ], tot_loss[loss=0.3562, simple_loss=0.392, pruned_loss=0.1602, over 4268072.41 frames. ], batch size: 298, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 22:07:03,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43260.0, ans=0.1 2023-06-17 22:07:14,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=43260.0, ans=0.125 2023-06-17 22:07:18,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.282e+02 4.227e+02 5.672e+02 1.166e+03, threshold=8.454e+02, percent-clipped=6.0 2023-06-17 22:07:37,901 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:08:31,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=43440.0, ans=0.0014260869565217386 2023-06-17 22:08:49,366 INFO [train.py:996] (0/4) Epoch 1, batch 7250, loss[loss=0.3824, simple_loss=0.3838, pruned_loss=0.1906, over 21418.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.3871, pruned_loss=0.1606, over 4266515.60 frames. ], batch size: 475, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 22:09:55,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=22.5 2023-06-17 22:10:15,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=43680.0, ans=0.035 2023-06-17 22:10:23,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=8.0 2023-06-17 22:10:47,937 INFO [train.py:996] (0/4) Epoch 1, batch 7300, loss[loss=0.3092, simple_loss=0.3353, pruned_loss=0.1415, over 21355.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.3796, pruned_loss=0.1587, over 4266061.15 frames. ], batch size: 144, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 22:11:32,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.61 vs. limit=22.5 2023-06-17 22:11:33,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=43860.0, ans=0.0 2023-06-17 22:11:39,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43860.0, ans=0.125 2023-06-17 22:11:54,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.416e+02 3.511e+02 4.213e+02 5.491e+02 1.019e+03, threshold=8.426e+02, percent-clipped=2.0 2023-06-17 22:12:04,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=43920.0, ans=0.05 2023-06-17 22:12:05,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43920.0, ans=0.0 2023-06-17 22:12:59,888 INFO [train.py:996] (0/4) Epoch 1, batch 7350, loss[loss=0.3908, simple_loss=0.4233, pruned_loss=0.1792, over 21299.00 frames. ], tot_loss[loss=0.349, simple_loss=0.3776, pruned_loss=0.1602, over 4266231.26 frames. ], batch size: 143, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 22:13:00,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=44100.0, ans=0.001282608695652174 2023-06-17 22:13:38,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=44220.0, ans=0.2 2023-06-17 22:13:48,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=44220.0, ans=0.125 2023-06-17 22:14:02,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=44280.0, ans=0.0012434782608695648 2023-06-17 22:14:28,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=44280.0, ans=0.0 2023-06-17 22:14:59,905 INFO [train.py:996] (0/4) Epoch 1, batch 7400, loss[loss=0.3352, simple_loss=0.3798, pruned_loss=0.1453, over 21615.00 frames. ], tot_loss[loss=0.3581, simple_loss=0.3881, pruned_loss=0.164, over 4263036.32 frames. ], batch size: 247, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 22:15:47,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.927e+02 4.881e+02 6.398e+02 1.158e+03, threshold=9.762e+02, percent-clipped=7.0 2023-06-17 22:15:47,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=44460.0, ans=0.125 2023-06-17 22:16:20,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=44580.0, ans=0.0 2023-06-17 22:17:04,094 INFO [train.py:996] (0/4) Epoch 1, batch 7450, loss[loss=0.3745, simple_loss=0.3935, pruned_loss=0.1777, over 15810.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.3845, pruned_loss=0.1619, over 4262036.59 frames. ], batch size: 64, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 22:17:38,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=44760.0, ans=0.125 2023-06-17 22:18:21,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=44820.0, ans=0.2 2023-06-17 22:18:54,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=44940.0, ans=0.0011000000000000003 2023-06-17 22:19:09,755 INFO [train.py:996] (0/4) Epoch 1, batch 7500, loss[loss=0.3513, simple_loss=0.418, pruned_loss=0.1423, over 21504.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.3899, pruned_loss=0.1642, over 4266739.91 frames. ], batch size: 194, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 22:20:07,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.630e+02 4.462e+02 5.942e+02 1.492e+03, threshold=8.924e+02, percent-clipped=4.0 2023-06-17 22:20:13,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-17 22:21:01,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=45240.0, ans=0.125 2023-06-17 22:21:01,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45240.0, ans=0.125 2023-06-17 22:21:06,590 INFO [train.py:996] (0/4) Epoch 1, batch 7550, loss[loss=0.3332, simple_loss=0.3767, pruned_loss=0.1449, over 21902.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.3953, pruned_loss=0.1599, over 4261275.18 frames. ], batch size: 98, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 22:21:08,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45300.0, ans=0.1 2023-06-17 22:21:29,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=45360.0, ans=10.0 2023-06-17 22:21:32,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45360.0, ans=0.1 2023-06-17 22:21:57,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-17 22:22:04,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-17 22:22:14,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=45480.0, ans=0.125 2023-06-17 22:22:23,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45540.0, ans=0.1 2023-06-17 22:22:41,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.39 vs. limit=5.0 2023-06-17 22:22:43,402 INFO [train.py:996] (0/4) Epoch 1, batch 7600, loss[loss=0.3553, simple_loss=0.3804, pruned_loss=0.1651, over 21743.00 frames. ], tot_loss[loss=0.356, simple_loss=0.395, pruned_loss=0.1585, over 4266895.55 frames. ], batch size: 230, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 22:23:15,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.356e+02 4.120e+02 5.350e+02 1.313e+03, threshold=8.240e+02, percent-clipped=1.0 2023-06-17 22:23:49,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=45720.0, ans=0.125 2023-06-17 22:24:44,507 INFO [train.py:996] (0/4) Epoch 1, batch 7650, loss[loss=0.3966, simple_loss=0.4369, pruned_loss=0.1782, over 21866.00 frames. ], tot_loss[loss=0.3598, simple_loss=0.3949, pruned_loss=0.1624, over 4276655.04 frames. ], batch size: 107, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 22:24:48,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-17 22:25:11,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45960.0, ans=0.1 2023-06-17 22:25:46,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=22.5 2023-06-17 22:25:59,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46020.0, ans=0.125 2023-06-17 22:26:11,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=46080.0, ans=0.125 2023-06-17 22:26:13,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=46080.0, ans=0.0 2023-06-17 22:26:19,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46080.0, ans=0.125 2023-06-17 22:26:33,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=46140.0, ans=10.0 2023-06-17 22:26:42,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=46140.0, ans=0.0008391304347826079 2023-06-17 22:27:09,905 INFO [train.py:996] (0/4) Epoch 1, batch 7700, loss[loss=0.3806, simple_loss=0.4189, pruned_loss=0.1712, over 21641.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.4011, pruned_loss=0.1668, over 4280969.59 frames. ], batch size: 263, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 22:27:38,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=46260.0, ans=0.0 2023-06-17 22:28:05,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=46260.0, ans=0.0 2023-06-17 22:28:07,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.716e+02 4.576e+02 5.497e+02 9.392e+02, threshold=9.152e+02, percent-clipped=2.0 2023-06-17 22:28:21,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46320.0, ans=0.1 2023-06-17 22:29:09,552 INFO [train.py:996] (0/4) Epoch 1, batch 7750, loss[loss=0.3684, simple_loss=0.4327, pruned_loss=0.1521, over 21624.00 frames. ], tot_loss[loss=0.3737, simple_loss=0.4095, pruned_loss=0.169, over 4278622.59 frames. ], batch size: 230, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 22:29:48,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-17 22:30:12,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=46620.0, ans=0.125 2023-06-17 22:30:45,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=46680.0, ans=0.125 2023-06-17 22:31:14,887 INFO [train.py:996] (0/4) Epoch 1, batch 7800, loss[loss=0.3245, simple_loss=0.3666, pruned_loss=0.1412, over 21677.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4075, pruned_loss=0.1668, over 4276620.41 frames. ], batch size: 263, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 22:31:18,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46800.0, ans=0.125 2023-06-17 22:32:11,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.644e+02 4.604e+02 5.812e+02 1.073e+03, threshold=9.208e+02, percent-clipped=1.0 2023-06-17 22:32:43,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=46980.0, ans=0.125 2023-06-17 22:33:00,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=47040.0, ans=0.0006434782608695649 2023-06-17 22:33:10,957 INFO [train.py:996] (0/4) Epoch 1, batch 7850, loss[loss=0.3239, simple_loss=0.354, pruned_loss=0.1469, over 21671.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.3966, pruned_loss=0.1632, over 4266425.86 frames. ], batch size: 283, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 22:33:59,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=47220.0, ans=0.2 2023-06-17 22:34:28,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=47340.0, ans=0.125 2023-06-17 22:34:37,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47340.0, ans=0.1 2023-06-17 22:34:45,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=47340.0, ans=0.125 2023-06-17 22:34:50,979 INFO [train.py:996] (0/4) Epoch 1, batch 7900, loss[loss=0.2975, simple_loss=0.344, pruned_loss=0.1255, over 21612.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3911, pruned_loss=0.1616, over 4273409.60 frames. ], batch size: 230, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 22:35:25,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=47460.0, ans=0.125 2023-06-17 22:35:28,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.542e+02 4.364e+02 5.493e+02 1.086e+03, threshold=8.728e+02, percent-clipped=7.0 2023-06-17 22:36:56,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.73 vs. limit=22.5 2023-06-17 22:37:15,570 INFO [train.py:996] (0/4) Epoch 1, batch 7950, loss[loss=0.3644, simple_loss=0.4082, pruned_loss=0.1603, over 21245.00 frames. ], tot_loss[loss=0.3623, simple_loss=0.3991, pruned_loss=0.1628, over 4270241.28 frames. ], batch size: 143, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 22:37:31,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.06 vs. limit=6.0 2023-06-17 22:39:25,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=47940.0, ans=0.125 2023-06-17 22:39:37,984 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-8000.pt 2023-06-17 22:39:41,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=48000.0, ans=0.09899494936611666 2023-06-17 22:39:42,793 INFO [train.py:996] (0/4) Epoch 1, batch 8000, loss[loss=0.4209, simple_loss=0.4752, pruned_loss=0.1833, over 21635.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.4053, pruned_loss=0.1681, over 4268168.33 frames. ], batch size: 414, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 22:39:59,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.77 vs. limit=22.5 2023-06-17 22:40:17,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 4.071e+02 5.066e+02 6.362e+02 1.546e+03, threshold=1.013e+03, percent-clipped=8.0 2023-06-17 22:41:03,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=48180.0, ans=10.0 2023-06-17 22:41:51,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=48240.0, ans=0.2 2023-06-17 22:42:17,230 INFO [train.py:996] (0/4) Epoch 1, batch 8050, loss[loss=0.4682, simple_loss=0.5039, pruned_loss=0.2163, over 21521.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4054, pruned_loss=0.1678, over 4261479.47 frames. ], batch size: 471, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 22:42:22,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=48300.0, ans=0.1 2023-06-17 22:42:25,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=48300.0, ans=0.0 2023-06-17 22:42:44,826 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:43:15,542 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-17 22:43:29,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=48480.0, ans=0.125 2023-06-17 22:44:07,431 INFO [train.py:996] (0/4) Epoch 1, batch 8100, loss[loss=0.3319, simple_loss=0.3598, pruned_loss=0.152, over 21579.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.406, pruned_loss=0.1673, over 4267138.15 frames. ], batch size: 195, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 22:44:16,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.70 vs. limit=15.0 2023-06-17 22:45:13,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.716e+02 4.690e+02 6.352e+02 1.621e+03, threshold=9.381e+02, percent-clipped=6.0 2023-06-17 22:45:18,397 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:45:45,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=48780.0, ans=0.125 2023-06-17 22:46:11,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48840.0, ans=0.1 2023-06-17 22:46:45,262 INFO [train.py:996] (0/4) Epoch 1, batch 8150, loss[loss=0.2627, simple_loss=0.3123, pruned_loss=0.1065, over 21201.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.4072, pruned_loss=0.1659, over 4265223.38 frames. ], batch size: 143, lr: 3.70e-02, grad_scale: 32.0 2023-06-17 22:46:46,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=48900.0, ans=0.2 2023-06-17 22:47:00,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-17 22:47:25,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.73 vs. limit=6.0 2023-06-17 22:47:45,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=49020.0, ans=0.09899494936611666 2023-06-17 22:48:34,256 INFO [train.py:996] (0/4) Epoch 1, batch 8200, loss[loss=0.3091, simple_loss=0.3393, pruned_loss=0.1395, over 21406.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.3999, pruned_loss=0.1616, over 4268226.85 frames. ], batch size: 212, lr: 3.70e-02, grad_scale: 32.0 2023-06-17 22:49:06,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=49260.0, ans=0.125 2023-06-17 22:49:19,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.766e+02 4.530e+02 5.768e+02 1.043e+03, threshold=9.060e+02, percent-clipped=2.0 2023-06-17 22:50:11,968 INFO [train.py:996] (0/4) Epoch 1, batch 8250, loss[loss=0.4391, simple_loss=0.471, pruned_loss=0.2036, over 21531.00 frames. ], tot_loss[loss=0.3634, simple_loss=0.4007, pruned_loss=0.163, over 4273817.24 frames. ], batch size: 471, lr: 3.69e-02, grad_scale: 32.0 2023-06-17 22:51:11,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=49620.0, ans=0.025 2023-06-17 22:51:11,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=49620.0, ans=0.125 2023-06-17 22:51:31,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-17 22:51:55,149 INFO [train.py:996] (0/4) Epoch 1, batch 8300, loss[loss=0.3858, simple_loss=0.4274, pruned_loss=0.1721, over 21653.00 frames. ], tot_loss[loss=0.357, simple_loss=0.3974, pruned_loss=0.1582, over 4273020.82 frames. ], batch size: 414, lr: 3.68e-02, grad_scale: 32.0 2023-06-17 22:52:34,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.435e+02 4.224e+02 5.428e+02 8.537e+02, threshold=8.449e+02, percent-clipped=0.0 2023-06-17 22:53:05,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=49980.0, ans=0.125 2023-06-17 22:53:13,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50040.0, ans=0.125 2023-06-17 22:53:32,140 INFO [train.py:996] (0/4) Epoch 1, batch 8350, loss[loss=0.4361, simple_loss=0.4577, pruned_loss=0.2072, over 20660.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3945, pruned_loss=0.1547, over 4266727.01 frames. ], batch size: 607, lr: 3.68e-02, grad_scale: 32.0 2023-06-17 22:54:19,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=50220.0, ans=0.0 2023-06-17 22:54:21,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-17 22:54:28,482 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:55:15,326 INFO [train.py:996] (0/4) Epoch 1, batch 8400, loss[loss=0.3104, simple_loss=0.3784, pruned_loss=0.1212, over 21728.00 frames. ], tot_loss[loss=0.3446, simple_loss=0.3896, pruned_loss=0.1498, over 4265322.32 frames. ], batch size: 351, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 22:56:05,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.068e+02 5.029e+02 6.288e+02 1.067e+03, threshold=1.006e+03, percent-clipped=6.0 2023-06-17 22:57:03,215 INFO [train.py:996] (0/4) Epoch 1, batch 8450, loss[loss=0.3311, simple_loss=0.3716, pruned_loss=0.1453, over 21698.00 frames. ], tot_loss[loss=0.3469, simple_loss=0.3906, pruned_loss=0.1516, over 4274139.62 frames. ], batch size: 263, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 22:57:19,549 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:57:51,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=50820.0, ans=0.04949747468305833 2023-06-17 22:58:19,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=50880.0, ans=0.125 2023-06-17 22:58:47,893 INFO [train.py:996] (0/4) Epoch 1, batch 8500, loss[loss=0.3356, simple_loss=0.3555, pruned_loss=0.1579, over 21264.00 frames. ], tot_loss[loss=0.3482, simple_loss=0.3879, pruned_loss=0.1542, over 4280167.84 frames. ], batch size: 548, lr: 3.66e-02, grad_scale: 32.0 2023-06-17 22:59:04,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51000.0, ans=0.125 2023-06-17 22:59:15,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=51060.0, ans=0.2 2023-06-17 22:59:27,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.628e+02 4.473e+02 5.603e+02 9.273e+02, threshold=8.945e+02, percent-clipped=0.0 2023-06-17 22:59:37,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=51120.0, ans=0.0 2023-06-17 23:00:03,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.63 vs. limit=22.5 2023-06-17 23:00:31,208 INFO [train.py:996] (0/4) Epoch 1, batch 8550, loss[loss=0.3367, simple_loss=0.356, pruned_loss=0.1587, over 21496.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3956, pruned_loss=0.1594, over 4268776.17 frames. ], batch size: 195, lr: 3.65e-02, grad_scale: 32.0 2023-06-17 23:01:21,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=51420.0, ans=0.125 2023-06-17 23:02:20,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=51540.0, ans=0.0 2023-06-17 23:02:23,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=12.0 2023-06-17 23:02:29,147 INFO [train.py:996] (0/4) Epoch 1, batch 8600, loss[loss=0.3757, simple_loss=0.4359, pruned_loss=0.1577, over 20980.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.4063, pruned_loss=0.1636, over 4267879.14 frames. ], batch size: 607, lr: 3.65e-02, grad_scale: 32.0 2023-06-17 23:02:48,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.22 vs. limit=6.0 2023-06-17 23:03:10,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.952e+02 4.835e+02 6.888e+02 1.478e+03, threshold=9.670e+02, percent-clipped=13.0 2023-06-17 23:04:14,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=51840.0, ans=0.125 2023-06-17 23:04:29,692 INFO [train.py:996] (0/4) Epoch 1, batch 8650, loss[loss=0.2889, simple_loss=0.3489, pruned_loss=0.1145, over 21765.00 frames. ], tot_loss[loss=0.3721, simple_loss=0.4132, pruned_loss=0.1655, over 4267254.53 frames. ], batch size: 124, lr: 3.64e-02, grad_scale: 32.0 2023-06-17 23:04:50,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=51960.0, ans=0.0 2023-06-17 23:04:53,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.24 vs. limit=10.0 2023-06-17 23:06:19,682 INFO [train.py:996] (0/4) Epoch 1, batch 8700, loss[loss=0.3186, simple_loss=0.35, pruned_loss=0.1436, over 21755.00 frames. ], tot_loss[loss=0.3607, simple_loss=0.404, pruned_loss=0.1587, over 4262532.08 frames. ], batch size: 124, lr: 3.64e-02, grad_scale: 32.0 2023-06-17 23:06:40,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=52260.0, ans=0.0 2023-06-17 23:06:59,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.304e+02 4.220e+02 5.103e+02 8.221e+02, threshold=8.441e+02, percent-clipped=0.0 2023-06-17 23:07:07,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2023-06-17 23:07:43,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=52440.0, ans=10.0 2023-06-17 23:07:58,798 INFO [train.py:996] (0/4) Epoch 1, batch 8750, loss[loss=0.337, simple_loss=0.3681, pruned_loss=0.153, over 21643.00 frames. ], tot_loss[loss=0.3611, simple_loss=0.4005, pruned_loss=0.1609, over 4262750.34 frames. ], batch size: 230, lr: 3.63e-02, grad_scale: 32.0 2023-06-17 23:08:16,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52500.0, ans=0.125 2023-06-17 23:09:54,981 INFO [train.py:996] (0/4) Epoch 1, batch 8800, loss[loss=0.4609, simple_loss=0.4813, pruned_loss=0.2202, over 21593.00 frames. ], tot_loss[loss=0.368, simple_loss=0.4072, pruned_loss=0.1644, over 4264244.52 frames. ], batch size: 414, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 23:10:41,777 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:10:45,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 4.248e+02 5.624e+02 7.492e+02 1.328e+03, threshold=1.125e+03, percent-clipped=14.0 2023-06-17 23:11:08,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=52980.0, ans=0.0 2023-06-17 23:11:24,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=52980.0, ans=0.2 2023-06-17 23:11:59,494 INFO [train.py:996] (0/4) Epoch 1, batch 8850, loss[loss=0.4969, simple_loss=0.5945, pruned_loss=0.1996, over 20798.00 frames. ], tot_loss[loss=0.3769, simple_loss=0.4169, pruned_loss=0.1685, over 4265526.77 frames. ], batch size: 607, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 23:13:38,033 INFO [train.py:996] (0/4) Epoch 1, batch 8900, loss[loss=0.3561, simple_loss=0.3964, pruned_loss=0.1579, over 21521.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.4087, pruned_loss=0.1659, over 4265693.11 frames. ], batch size: 389, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 23:13:55,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-17 23:13:58,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=53460.0, ans=0.125 2023-06-17 23:14:12,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.765e+02 4.627e+02 5.813e+02 9.173e+02, threshold=9.253e+02, percent-clipped=0.0 2023-06-17 23:14:39,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=53520.0, ans=12.0 2023-06-17 23:15:26,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=53640.0, ans=0.125 2023-06-17 23:15:30,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-17 23:15:31,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53640.0, ans=0.125 2023-06-17 23:15:38,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=53640.0, ans=0.125 2023-06-17 23:15:46,977 INFO [train.py:996] (0/4) Epoch 1, batch 8950, loss[loss=0.2721, simple_loss=0.3099, pruned_loss=0.1171, over 21292.00 frames. ], tot_loss[loss=0.369, simple_loss=0.4104, pruned_loss=0.1638, over 4268069.88 frames. ], batch size: 176, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 23:16:41,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=53820.0, ans=0.07 2023-06-17 23:16:56,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=53820.0, ans=0.125 2023-06-17 23:17:50,549 INFO [train.py:996] (0/4) Epoch 1, batch 9000, loss[loss=0.4251, simple_loss=0.506, pruned_loss=0.1721, over 19739.00 frames. ], tot_loss[loss=0.3632, simple_loss=0.4012, pruned_loss=0.1626, over 4265200.65 frames. ], batch size: 702, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 23:17:50,550 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 23:18:41,565 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3222, simple_loss=0.4116, pruned_loss=0.1164, over 1796401.00 frames. 2023-06-17 23:18:41,566 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-17 23:19:15,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=54060.0, ans=0.1 2023-06-17 23:19:27,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.716e+02 4.512e+02 5.914e+02 1.006e+03, threshold=9.023e+02, percent-clipped=2.0 2023-06-17 23:19:33,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=54120.0, ans=0.125 2023-06-17 23:20:25,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=54240.0, ans=0.0 2023-06-17 23:20:43,148 INFO [train.py:996] (0/4) Epoch 1, batch 9050, loss[loss=0.4424, simple_loss=0.4574, pruned_loss=0.2137, over 21443.00 frames. ], tot_loss[loss=0.358, simple_loss=0.3991, pruned_loss=0.1584, over 4268918.70 frames. ], batch size: 471, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 23:22:56,139 INFO [train.py:996] (0/4) Epoch 1, batch 9100, loss[loss=0.3541, simple_loss=0.4198, pruned_loss=0.1441, over 21642.00 frames. ], tot_loss[loss=0.3653, simple_loss=0.4062, pruned_loss=0.1622, over 4274062.06 frames. ], batch size: 389, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 23:23:27,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=54660.0, ans=0.125 2023-06-17 23:23:29,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=54660.0, ans=0.125 2023-06-17 23:23:30,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.462e+02 4.314e+02 5.762e+02 1.601e+03, threshold=8.627e+02, percent-clipped=9.0 2023-06-17 23:23:58,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-17 23:24:18,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=54840.0, ans=0.2 2023-06-17 23:24:33,672 INFO [train.py:996] (0/4) Epoch 1, batch 9150, loss[loss=0.3451, simple_loss=0.4112, pruned_loss=0.1394, over 21738.00 frames. ], tot_loss[loss=0.3595, simple_loss=0.4052, pruned_loss=0.1569, over 4276937.40 frames. ], batch size: 351, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 23:24:48,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=54900.0, ans=0.0 2023-06-17 23:25:03,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-17 23:25:40,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=55020.0, ans=0.125 2023-06-17 23:26:15,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=55080.0, ans=0.0 2023-06-17 23:26:15,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=55080.0, ans=10.0 2023-06-17 23:26:45,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=55140.0, ans=0.0 2023-06-17 23:26:47,972 INFO [train.py:996] (0/4) Epoch 1, batch 9200, loss[loss=0.4764, simple_loss=0.4862, pruned_loss=0.2334, over 21731.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.4053, pruned_loss=0.156, over 4272676.94 frames. ], batch size: 441, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 23:27:12,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.23 vs. limit=22.5 2023-06-17 23:27:24,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.827e+02 4.968e+02 6.967e+02 1.252e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-17 23:28:45,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-17 23:28:47,260 INFO [train.py:996] (0/4) Epoch 1, batch 9250, loss[loss=0.4271, simple_loss=0.432, pruned_loss=0.2111, over 21644.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.4108, pruned_loss=0.1635, over 4274909.83 frames. ], batch size: 441, lr: 3.57e-02, grad_scale: 16.0 2023-06-17 23:29:08,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.00 vs. limit=6.0 2023-06-17 23:29:16,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-17 23:29:22,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=55620.0, ans=15.0 2023-06-17 23:29:30,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=55620.0, ans=0.125 2023-06-17 23:29:30,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=55620.0, ans=0.125 2023-06-17 23:30:26,451 INFO [train.py:996] (0/4) Epoch 1, batch 9300, loss[loss=0.3256, simple_loss=0.3681, pruned_loss=0.1416, over 21362.00 frames. ], tot_loss[loss=0.3661, simple_loss=0.4057, pruned_loss=0.1632, over 4270149.60 frames. ], batch size: 131, lr: 3.57e-02, grad_scale: 16.0 2023-06-17 23:30:28,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=55800.0, ans=0.2 2023-06-17 23:30:33,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-17 23:31:10,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.819e+02 4.444e+02 5.472e+02 6.937e+02 1.249e+03, threshold=1.094e+03, percent-clipped=7.0 2023-06-17 23:32:07,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.21 vs. limit=6.0 2023-06-17 23:32:34,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=56100.0, ans=0.125 2023-06-17 23:32:35,226 INFO [train.py:996] (0/4) Epoch 1, batch 9350, loss[loss=0.3566, simple_loss=0.4046, pruned_loss=0.1543, over 21661.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.4113, pruned_loss=0.1636, over 4260387.88 frames. ], batch size: 230, lr: 3.56e-02, grad_scale: 16.0 2023-06-17 23:32:43,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56100.0, ans=0.1 2023-06-17 23:32:54,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-17 23:32:58,423 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:34:32,809 INFO [train.py:996] (0/4) Epoch 1, batch 9400, loss[loss=0.3117, simple_loss=0.3541, pruned_loss=0.1347, over 15696.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.4127, pruned_loss=0.1643, over 4257724.27 frames. ], batch size: 60, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 23:34:56,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=56400.0, ans=0.0 2023-06-17 23:34:56,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=56400.0, ans=0.125 2023-06-17 23:35:26,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.466e+02 4.550e+02 6.027e+02 9.606e+02, threshold=9.099e+02, percent-clipped=0.0 2023-06-17 23:35:54,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=56580.0, ans=0.125 2023-06-17 23:36:22,979 INFO [train.py:996] (0/4) Epoch 1, batch 9450, loss[loss=0.3505, simple_loss=0.3826, pruned_loss=0.1592, over 21834.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.4023, pruned_loss=0.1613, over 4268468.00 frames. ], batch size: 107, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 23:37:21,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=56820.0, ans=0.0 2023-06-17 23:37:51,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56940.0, ans=0.1 2023-06-17 23:37:51,452 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:38:00,025 INFO [train.py:996] (0/4) Epoch 1, batch 9500, loss[loss=0.3601, simple_loss=0.4014, pruned_loss=0.1594, over 21445.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.3925, pruned_loss=0.1572, over 4264533.66 frames. ], batch size: 471, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 23:38:14,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2023-06-17 23:38:50,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 3.365e+02 4.382e+02 5.694e+02 1.167e+03, threshold=8.764e+02, percent-clipped=2.0 2023-06-17 23:39:34,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=12.0 2023-06-17 23:40:02,010 INFO [train.py:996] (0/4) Epoch 1, batch 9550, loss[loss=0.4177, simple_loss=0.4353, pruned_loss=0.2001, over 21350.00 frames. ], tot_loss[loss=0.3609, simple_loss=0.3981, pruned_loss=0.1619, over 4272191.26 frames. ], batch size: 548, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 23:40:10,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=57300.0, ans=0.125 2023-06-17 23:41:32,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=57480.0, ans=0.0 2023-06-17 23:41:58,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=57600.0, ans=0.035 2023-06-17 23:41:59,562 INFO [train.py:996] (0/4) Epoch 1, batch 9600, loss[loss=0.4159, simple_loss=0.4947, pruned_loss=0.1685, over 20774.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.4009, pruned_loss=0.1634, over 4278360.35 frames. ], batch size: 607, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 23:42:08,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=57600.0, ans=0.0 2023-06-17 23:42:54,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.734e+02 4.517e+02 6.077e+02 1.128e+03, threshold=9.035e+02, percent-clipped=4.0 2023-06-17 23:43:09,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=57780.0, ans=0.0 2023-06-17 23:43:24,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=57780.0, ans=0.125 2023-06-17 23:43:35,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=57840.0, ans=0.2 2023-06-17 23:43:54,556 INFO [train.py:996] (0/4) Epoch 1, batch 9650, loss[loss=0.4037, simple_loss=0.4343, pruned_loss=0.1866, over 21694.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.4003, pruned_loss=0.163, over 4282871.51 frames. ], batch size: 351, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 23:44:33,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=57960.0, ans=0.0 2023-06-17 23:44:33,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-17 23:44:57,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=57960.0, ans=0.125 2023-06-17 23:44:59,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=57960.0, ans=0.0 2023-06-17 23:45:25,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=58080.0, ans=0.0 2023-06-17 23:45:29,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=58080.0, ans=0.0 2023-06-17 23:45:54,822 INFO [train.py:996] (0/4) Epoch 1, batch 9700, loss[loss=0.3412, simple_loss=0.3889, pruned_loss=0.1468, over 21416.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4037, pruned_loss=0.1633, over 4281050.68 frames. ], batch size: 548, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 23:46:22,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=58260.0, ans=0.05 2023-06-17 23:46:40,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.561e+02 4.162e+02 5.499e+02 1.221e+03, threshold=8.324e+02, percent-clipped=3.0 2023-06-17 23:47:02,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=58320.0, ans=0.0 2023-06-17 23:47:16,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=6.0 2023-06-17 23:47:23,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=58440.0, ans=0.05 2023-06-17 23:47:54,057 INFO [train.py:996] (0/4) Epoch 1, batch 9750, loss[loss=0.3312, simple_loss=0.3639, pruned_loss=0.1493, over 21852.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.3953, pruned_loss=0.161, over 4281319.15 frames. ], batch size: 107, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 23:48:01,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=58500.0, ans=0.125 2023-06-17 23:48:26,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=58560.0, ans=0.0 2023-06-17 23:48:34,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-17 23:49:09,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=58680.0, ans=0.125 2023-06-17 23:49:25,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-06-17 23:49:43,174 INFO [train.py:996] (0/4) Epoch 1, batch 9800, loss[loss=0.3767, simple_loss=0.3972, pruned_loss=0.1781, over 21796.00 frames. ], tot_loss[loss=0.3571, simple_loss=0.3941, pruned_loss=0.1601, over 4285578.76 frames. ], batch size: 441, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 23:50:48,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.805e+02 4.466e+02 5.878e+02 9.239e+02, threshold=8.932e+02, percent-clipped=2.0 2023-06-17 23:51:36,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=59040.0, ans=0.125 2023-06-17 23:51:44,813 INFO [train.py:996] (0/4) Epoch 1, batch 9850, loss[loss=0.3472, simple_loss=0.3705, pruned_loss=0.162, over 21627.00 frames. ], tot_loss[loss=0.3554, simple_loss=0.3914, pruned_loss=0.1597, over 4286611.86 frames. ], batch size: 391, lr: 3.50e-02, grad_scale: 32.0 2023-06-17 23:52:01,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59100.0, ans=0.1 2023-06-17 23:52:10,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=59100.0, ans=0.125 2023-06-17 23:52:10,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=59100.0, ans=0.125 2023-06-17 23:53:42,085 INFO [train.py:996] (0/4) Epoch 1, batch 9900, loss[loss=0.4158, simple_loss=0.4458, pruned_loss=0.1929, over 21571.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3878, pruned_loss=0.1584, over 4260729.01 frames. ], batch size: 389, lr: 3.50e-02, grad_scale: 32.0 2023-06-17 23:54:37,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.479e+02 4.439e+02 5.469e+02 9.869e+02, threshold=8.878e+02, percent-clipped=4.0 2023-06-17 23:54:49,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=59520.0, ans=0.04949747468305833 2023-06-17 23:55:19,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59640.0, ans=0.1 2023-06-17 23:55:27,581 INFO [train.py:996] (0/4) Epoch 1, batch 9950, loss[loss=0.3425, simple_loss=0.3632, pruned_loss=0.1609, over 21546.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.3906, pruned_loss=0.1615, over 4258977.40 frames. ], batch size: 263, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 23:56:17,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=59820.0, ans=0.125 2023-06-17 23:57:04,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=59940.0, ans=0.0 2023-06-17 23:57:04,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=59940.0, ans=0.125 2023-06-17 23:57:11,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=59940.0, ans=0.0 2023-06-17 23:57:26,288 INFO [train.py:996] (0/4) Epoch 1, batch 10000, loss[loss=0.419, simple_loss=0.4179, pruned_loss=0.2101, over 21275.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3886, pruned_loss=0.1609, over 4263810.15 frames. ], batch size: 471, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 23:58:18,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=60060.0, ans=0.0 2023-06-17 23:58:22,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.612e+02 3.683e+02 4.306e+02 5.455e+02 9.094e+02, threshold=8.612e+02, percent-clipped=1.0 2023-06-17 23:58:22,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60120.0, ans=0.125 2023-06-17 23:58:51,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=60180.0, ans=0.2 2023-06-17 23:59:40,484 INFO [train.py:996] (0/4) Epoch 1, batch 10050, loss[loss=0.317, simple_loss=0.3614, pruned_loss=0.1363, over 21266.00 frames. ], tot_loss[loss=0.3574, simple_loss=0.3909, pruned_loss=0.162, over 4269634.68 frames. ], batch size: 549, lr: 3.48e-02, grad_scale: 32.0 2023-06-18 00:00:19,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=60360.0, ans=0.2 2023-06-18 00:00:22,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=60360.0, ans=0.95 2023-06-18 00:00:52,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60420.0, ans=0.1 2023-06-18 00:01:17,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=60480.0, ans=0.2 2023-06-18 00:01:58,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=60540.0, ans=0.0 2023-06-18 00:02:00,514 INFO [train.py:996] (0/4) Epoch 1, batch 10100, loss[loss=0.3527, simple_loss=0.3989, pruned_loss=0.1533, over 21747.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3872, pruned_loss=0.1575, over 4266248.16 frames. ], batch size: 351, lr: 3.47e-02, grad_scale: 32.0 2023-06-18 00:02:15,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=60600.0, ans=0.0 2023-06-18 00:02:45,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.604e+02 4.387e+02 5.436e+02 8.331e+02, threshold=8.774e+02, percent-clipped=0.0 2023-06-18 00:02:52,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60720.0, ans=0.125 2023-06-18 00:03:05,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=60720.0, ans=0.125 2023-06-18 00:03:43,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=60840.0, ans=0.0 2023-06-18 00:04:07,999 INFO [train.py:996] (0/4) Epoch 1, batch 10150, loss[loss=0.3562, simple_loss=0.4098, pruned_loss=0.1513, over 21361.00 frames. ], tot_loss[loss=0.359, simple_loss=0.3946, pruned_loss=0.1617, over 4264219.17 frames. ], batch size: 131, lr: 3.47e-02, grad_scale: 32.0 2023-06-18 00:05:01,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=61080.0, ans=0.2 2023-06-18 00:05:03,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-18 00:05:14,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=61080.0, ans=0.2 2023-06-18 00:05:46,100 INFO [train.py:996] (0/4) Epoch 1, batch 10200, loss[loss=0.3609, simple_loss=0.4095, pruned_loss=0.1561, over 21534.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.3917, pruned_loss=0.1573, over 4257276.69 frames. ], batch size: 441, lr: 3.46e-02, grad_scale: 32.0 2023-06-18 00:06:21,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 3.332e+02 4.219e+02 5.598e+02 1.332e+03, threshold=8.438e+02, percent-clipped=6.0 2023-06-18 00:07:23,917 INFO [train.py:996] (0/4) Epoch 1, batch 10250, loss[loss=0.4057, simple_loss=0.4455, pruned_loss=0.1829, over 21481.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3877, pruned_loss=0.1493, over 4260480.52 frames. ], batch size: 131, lr: 3.46e-02, grad_scale: 16.0 2023-06-18 00:07:50,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=61560.0, ans=0.125 2023-06-18 00:08:33,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=61680.0, ans=0.09899494936611666 2023-06-18 00:08:43,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61680.0, ans=0.1 2023-06-18 00:08:43,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-18 00:08:52,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=61740.0, ans=0.125 2023-06-18 00:09:09,878 INFO [train.py:996] (0/4) Epoch 1, batch 10300, loss[loss=0.3818, simple_loss=0.4168, pruned_loss=0.1734, over 20715.00 frames. ], tot_loss[loss=0.3443, simple_loss=0.389, pruned_loss=0.1498, over 4259469.41 frames. ], batch size: 607, lr: 3.45e-02, grad_scale: 16.0 2023-06-18 00:09:19,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=61800.0, ans=0.0 2023-06-18 00:09:24,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=61800.0, ans=0.5 2023-06-18 00:09:50,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=61860.0, ans=0.0 2023-06-18 00:09:50,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61860.0, ans=0.1 2023-06-18 00:09:57,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-18 00:10:24,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=61920.0, ans=0.2 2023-06-18 00:10:24,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.585e+02 4.803e+02 7.199e+02 1.796e+03, threshold=9.605e+02, percent-clipped=14.0 2023-06-18 00:10:57,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=61980.0, ans=0.125 2023-06-18 00:11:06,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=62040.0, ans=0.125 2023-06-18 00:11:12,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=62040.0, ans=0.125 2023-06-18 00:11:27,152 INFO [train.py:996] (0/4) Epoch 1, batch 10350, loss[loss=0.3999, simple_loss=0.4307, pruned_loss=0.1845, over 21475.00 frames. ], tot_loss[loss=0.3429, simple_loss=0.389, pruned_loss=0.1484, over 4260984.23 frames. ], batch size: 471, lr: 3.45e-02, grad_scale: 16.0 2023-06-18 00:11:32,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62100.0, ans=0.1 2023-06-18 00:11:59,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62160.0, ans=0.1 2023-06-18 00:12:15,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=62220.0, ans=0.0 2023-06-18 00:12:41,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=62280.0, ans=0.125 2023-06-18 00:12:42,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=62280.0, ans=0.2 2023-06-18 00:13:08,618 INFO [train.py:996] (0/4) Epoch 1, batch 10400, loss[loss=0.3319, simple_loss=0.3854, pruned_loss=0.1392, over 21554.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3774, pruned_loss=0.1436, over 4252693.59 frames. ], batch size: 441, lr: 3.44e-02, grad_scale: 32.0 2023-06-18 00:13:09,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62400.0, ans=0.1 2023-06-18 00:13:51,650 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.668e+02 4.396e+02 5.299e+02 1.049e+03, threshold=8.792e+02, percent-clipped=2.0 2023-06-18 00:13:52,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=62520.0, ans=0.0 2023-06-18 00:14:37,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62640.0, ans=0.1 2023-06-18 00:14:47,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-18 00:14:48,066 INFO [train.py:996] (0/4) Epoch 1, batch 10450, loss[loss=0.3676, simple_loss=0.4083, pruned_loss=0.1634, over 21387.00 frames. ], tot_loss[loss=0.3393, simple_loss=0.3819, pruned_loss=0.1483, over 4252329.29 frames. ], batch size: 131, lr: 3.44e-02, grad_scale: 32.0 2023-06-18 00:16:00,862 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:16:11,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-18 00:16:16,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=62880.0, ans=0.025 2023-06-18 00:16:59,889 INFO [train.py:996] (0/4) Epoch 1, batch 10500, loss[loss=0.3551, simple_loss=0.3826, pruned_loss=0.1638, over 21463.00 frames. ], tot_loss[loss=0.342, simple_loss=0.3853, pruned_loss=0.1493, over 4257237.57 frames. ], batch size: 441, lr: 3.43e-02, grad_scale: 16.0 2023-06-18 00:17:54,111 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.435e+02 3.581e+02 4.532e+02 5.617e+02 1.542e+03, threshold=9.064e+02, percent-clipped=4.0 2023-06-18 00:18:20,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=22.5 2023-06-18 00:18:42,286 INFO [train.py:996] (0/4) Epoch 1, batch 10550, loss[loss=0.3372, simple_loss=0.3664, pruned_loss=0.154, over 21743.00 frames. ], tot_loss[loss=0.3411, simple_loss=0.3811, pruned_loss=0.1505, over 4257344.55 frames. ], batch size: 351, lr: 3.43e-02, grad_scale: 16.0 2023-06-18 00:18:51,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=63300.0, ans=0.125 2023-06-18 00:19:37,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=63420.0, ans=0.2 2023-06-18 00:20:02,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=63480.0, ans=0.125 2023-06-18 00:20:41,287 INFO [train.py:996] (0/4) Epoch 1, batch 10600, loss[loss=0.2655, simple_loss=0.3324, pruned_loss=0.09933, over 21248.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3749, pruned_loss=0.1477, over 4247762.14 frames. ], batch size: 176, lr: 3.42e-02, grad_scale: 16.0 2023-06-18 00:20:53,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=63600.0, ans=0.0 2023-06-18 00:21:01,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=63600.0, ans=0.025 2023-06-18 00:21:01,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63600.0, ans=0.1 2023-06-18 00:21:32,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-18 00:21:42,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 3.242e+02 3.917e+02 4.950e+02 7.459e+02, threshold=7.834e+02, percent-clipped=0.0 2023-06-18 00:21:56,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=63780.0, ans=0.0 2023-06-18 00:22:39,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=63840.0, ans=0.0 2023-06-18 00:23:04,592 INFO [train.py:996] (0/4) Epoch 1, batch 10650, loss[loss=0.3342, simple_loss=0.3897, pruned_loss=0.1393, over 21577.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3784, pruned_loss=0.146, over 4253328.65 frames. ], batch size: 441, lr: 3.41e-02, grad_scale: 16.0 2023-06-18 00:23:28,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=63960.0, ans=0.09899494936611666 2023-06-18 00:25:12,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-18 00:25:14,492 INFO [train.py:996] (0/4) Epoch 1, batch 10700, loss[loss=0.3583, simple_loss=0.378, pruned_loss=0.1693, over 21321.00 frames. ], tot_loss[loss=0.3347, simple_loss=0.3773, pruned_loss=0.146, over 4246120.33 frames. ], batch size: 471, lr: 3.41e-02, grad_scale: 16.0 2023-06-18 00:25:19,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=64200.0, ans=0.125 2023-06-18 00:25:22,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=64200.0, ans=0.09899494936611666 2023-06-18 00:25:26,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-18 00:25:28,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=64260.0, ans=0.0 2023-06-18 00:25:53,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.609e+02 4.403e+02 5.419e+02 8.654e+02, threshold=8.805e+02, percent-clipped=2.0 2023-06-18 00:25:55,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=64320.0, ans=0.125 2023-06-18 00:26:01,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=64320.0, ans=0.0 2023-06-18 00:27:08,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-18 00:27:16,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-18 00:27:17,064 INFO [train.py:996] (0/4) Epoch 1, batch 10750, loss[loss=0.4927, simple_loss=0.523, pruned_loss=0.2312, over 21533.00 frames. ], tot_loss[loss=0.348, simple_loss=0.3897, pruned_loss=0.1531, over 4254040.81 frames. ], batch size: 471, lr: 3.40e-02, grad_scale: 16.0 2023-06-18 00:27:22,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=64500.0, ans=0.125 2023-06-18 00:27:35,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-06-18 00:28:43,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.57 vs. limit=15.0 2023-06-18 00:29:06,562 INFO [train.py:996] (0/4) Epoch 1, batch 10800, loss[loss=0.3947, simple_loss=0.4231, pruned_loss=0.1831, over 21744.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3941, pruned_loss=0.1536, over 4256735.40 frames. ], batch size: 298, lr: 3.40e-02, grad_scale: 32.0 2023-06-18 00:29:33,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=64860.0, ans=0.125 2023-06-18 00:29:34,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-18 00:29:48,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64860.0, ans=0.125 2023-06-18 00:30:01,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.483e+02 3.969e+02 4.979e+02 7.720e+02, threshold=7.938e+02, percent-clipped=0.0 2023-06-18 00:31:13,349 INFO [train.py:996] (0/4) Epoch 1, batch 10850, loss[loss=0.3689, simple_loss=0.3863, pruned_loss=0.1758, over 21285.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.3962, pruned_loss=0.1551, over 4260583.55 frames. ], batch size: 471, lr: 3.39e-02, grad_scale: 32.0 2023-06-18 00:31:15,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-18 00:31:42,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=65160.0, ans=0.125 2023-06-18 00:32:16,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=65220.0, ans=0.2 2023-06-18 00:33:12,047 INFO [train.py:996] (0/4) Epoch 1, batch 10900, loss[loss=0.2981, simple_loss=0.3733, pruned_loss=0.1115, over 21409.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.3911, pruned_loss=0.1536, over 4254552.47 frames. ], batch size: 211, lr: 3.39e-02, grad_scale: 32.0 2023-06-18 00:33:51,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.204e+02 4.085e+02 4.812e+02 7.317e+02, threshold=8.170e+02, percent-clipped=0.0 2023-06-18 00:34:49,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=65700.0, ans=0.125 2023-06-18 00:34:50,800 INFO [train.py:996] (0/4) Epoch 1, batch 10950, loss[loss=0.4028, simple_loss=0.4279, pruned_loss=0.1888, over 20640.00 frames. ], tot_loss[loss=0.3434, simple_loss=0.3858, pruned_loss=0.1505, over 4255784.93 frames. ], batch size: 607, lr: 3.38e-02, grad_scale: 32.0 2023-06-18 00:35:11,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65760.0, ans=0.0 2023-06-18 00:36:17,679 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:36:45,679 INFO [train.py:996] (0/4) Epoch 1, batch 11000, loss[loss=0.3943, simple_loss=0.4316, pruned_loss=0.1785, over 21722.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.3838, pruned_loss=0.1507, over 4258852.43 frames. ], batch size: 112, lr: 3.38e-02, grad_scale: 32.0 2023-06-18 00:37:35,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.527e+02 3.687e+02 4.482e+02 5.428e+02 9.093e+02, threshold=8.964e+02, percent-clipped=3.0 2023-06-18 00:38:49,265 INFO [train.py:996] (0/4) Epoch 1, batch 11050, loss[loss=0.3399, simple_loss=0.3709, pruned_loss=0.1544, over 21961.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.3807, pruned_loss=0.1515, over 4269646.70 frames. ], batch size: 103, lr: 3.37e-02, grad_scale: 32.0 2023-06-18 00:38:51,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-18 00:39:55,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=66420.0, ans=0.035 2023-06-18 00:40:18,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66480.0, ans=0.1 2023-06-18 00:40:19,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-18 00:40:36,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66540.0, ans=0.125 2023-06-18 00:40:44,263 INFO [train.py:996] (0/4) Epoch 1, batch 11100, loss[loss=0.3152, simple_loss=0.3518, pruned_loss=0.1393, over 21322.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3761, pruned_loss=0.1503, over 4272427.44 frames. ], batch size: 194, lr: 3.37e-02, grad_scale: 32.0 2023-06-18 00:40:59,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=66600.0, ans=0.0 2023-06-18 00:41:01,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66600.0, ans=0.125 2023-06-18 00:41:17,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=66660.0, ans=0.035 2023-06-18 00:41:20,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=66660.0, ans=0.0 2023-06-18 00:41:27,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=66660.0, ans=0.125 2023-06-18 00:41:39,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.202e+02 3.919e+02 4.833e+02 8.145e+02, threshold=7.838e+02, percent-clipped=0.0 2023-06-18 00:42:12,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66780.0, ans=0.125 2023-06-18 00:42:39,303 INFO [train.py:996] (0/4) Epoch 1, batch 11150, loss[loss=0.3123, simple_loss=0.3461, pruned_loss=0.1393, over 21259.00 frames. ], tot_loss[loss=0.3363, simple_loss=0.3739, pruned_loss=0.1493, over 4277023.78 frames. ], batch size: 176, lr: 3.36e-02, grad_scale: 32.0 2023-06-18 00:42:55,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66960.0, ans=0.1 2023-06-18 00:42:56,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-18 00:42:57,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=66960.0, ans=0.0 2023-06-18 00:42:57,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-18 00:43:27,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-18 00:43:32,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-18 00:43:50,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=67080.0, ans=0.125 2023-06-18 00:43:51,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=67080.0, ans=0.125 2023-06-18 00:44:02,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-18 00:44:22,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67140.0, ans=0.1 2023-06-18 00:44:28,575 INFO [train.py:996] (0/4) Epoch 1, batch 11200, loss[loss=0.3235, simple_loss=0.3468, pruned_loss=0.1501, over 21627.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3715, pruned_loss=0.1487, over 4273008.63 frames. ], batch size: 282, lr: 3.36e-02, grad_scale: 32.0 2023-06-18 00:44:33,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67200.0, ans=0.125 2023-06-18 00:44:42,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67260.0, ans=0.1 2023-06-18 00:44:43,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.44 vs. limit=22.5 2023-06-18 00:45:07,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.477e+02 4.166e+02 5.140e+02 9.115e+02, threshold=8.331e+02, percent-clipped=3.0 2023-06-18 00:45:38,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=67380.0, ans=0.125 2023-06-18 00:46:00,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=67440.0, ans=0.0 2023-06-18 00:46:00,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.35 vs. limit=6.0 2023-06-18 00:46:03,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=67440.0, ans=0.125 2023-06-18 00:46:05,991 INFO [train.py:996] (0/4) Epoch 1, batch 11250, loss[loss=0.2987, simple_loss=0.3329, pruned_loss=0.1322, over 21730.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.371, pruned_loss=0.1483, over 4270913.42 frames. ], batch size: 300, lr: 3.35e-02, grad_scale: 32.0 2023-06-18 00:47:20,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=67680.0, ans=0.2 2023-06-18 00:47:49,278 INFO [train.py:996] (0/4) Epoch 1, batch 11300, loss[loss=0.2758, simple_loss=0.3335, pruned_loss=0.1091, over 21264.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3719, pruned_loss=0.1484, over 4267472.31 frames. ], batch size: 143, lr: 3.35e-02, grad_scale: 32.0 2023-06-18 00:48:59,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=67920.0, ans=0.125 2023-06-18 00:49:00,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.206e+02 3.828e+02 4.812e+02 8.998e+02, threshold=7.656e+02, percent-clipped=1.0 2023-06-18 00:49:02,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=67920.0, ans=0.0 2023-06-18 00:49:45,810 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:50:09,173 INFO [train.py:996] (0/4) Epoch 1, batch 11350, loss[loss=0.4839, simple_loss=0.4846, pruned_loss=0.2416, over 21336.00 frames. ], tot_loss[loss=0.3348, simple_loss=0.3738, pruned_loss=0.1479, over 4271939.62 frames. ], batch size: 507, lr: 3.34e-02, grad_scale: 16.0 2023-06-18 00:50:52,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.09 vs. limit=10.0 2023-06-18 00:51:47,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=22.5 2023-06-18 00:52:07,603 INFO [train.py:996] (0/4) Epoch 1, batch 11400, loss[loss=0.3206, simple_loss=0.3762, pruned_loss=0.1325, over 21228.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.3831, pruned_loss=0.1522, over 4275947.85 frames. ], batch size: 143, lr: 3.34e-02, grad_scale: 16.0 2023-06-18 00:52:14,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=68400.0, ans=0.125 2023-06-18 00:52:21,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-18 00:53:06,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=68460.0, ans=0.0 2023-06-18 00:53:16,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.896e+02 4.865e+02 6.013e+02 1.206e+03, threshold=9.731e+02, percent-clipped=12.0 2023-06-18 00:54:09,510 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:54:10,461 INFO [train.py:996] (0/4) Epoch 1, batch 11450, loss[loss=0.4002, simple_loss=0.4342, pruned_loss=0.183, over 21827.00 frames. ], tot_loss[loss=0.3447, simple_loss=0.386, pruned_loss=0.1517, over 4272281.16 frames. ], batch size: 124, lr: 3.33e-02, grad_scale: 16.0 2023-06-18 00:54:12,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=68700.0, ans=0.125 2023-06-18 00:54:25,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=68700.0, ans=0.0 2023-06-18 00:54:44,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=68760.0, ans=0.025 2023-06-18 00:54:48,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68760.0, ans=0.125 2023-06-18 00:55:29,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=68820.0, ans=0.125 2023-06-18 00:55:29,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=68820.0, ans=0.0 2023-06-18 00:56:02,444 INFO [train.py:996] (0/4) Epoch 1, batch 11500, loss[loss=0.3104, simple_loss=0.3797, pruned_loss=0.1206, over 21898.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3882, pruned_loss=0.1516, over 4275373.95 frames. ], batch size: 316, lr: 3.33e-02, grad_scale: 16.0 2023-06-18 00:56:27,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=69060.0, ans=10.0 2023-06-18 00:57:20,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.435e+02 4.128e+02 5.314e+02 9.134e+02, threshold=8.255e+02, percent-clipped=0.0 2023-06-18 00:58:13,017 INFO [train.py:996] (0/4) Epoch 1, batch 11550, loss[loss=0.4406, simple_loss=0.5084, pruned_loss=0.1864, over 21852.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.3922, pruned_loss=0.1501, over 4271978.97 frames. ], batch size: 371, lr: 3.32e-02, grad_scale: 16.0 2023-06-18 00:58:59,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-18 00:59:01,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=69420.0, ans=0.125 2023-06-18 01:00:05,182 INFO [train.py:996] (0/4) Epoch 1, batch 11600, loss[loss=0.3735, simple_loss=0.4676, pruned_loss=0.1397, over 21197.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.406, pruned_loss=0.1515, over 4267521.43 frames. ], batch size: 548, lr: 3.32e-02, grad_scale: 32.0 2023-06-18 01:00:50,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.816e+02 4.671e+02 6.267e+02 1.056e+03, threshold=9.343e+02, percent-clipped=9.0 2023-06-18 01:00:55,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=69720.0, ans=0.05 2023-06-18 01:01:57,959 INFO [train.py:996] (0/4) Epoch 1, batch 11650, loss[loss=0.3501, simple_loss=0.4266, pruned_loss=0.1368, over 21620.00 frames. ], tot_loss[loss=0.3543, simple_loss=0.4089, pruned_loss=0.1499, over 4271594.16 frames. ], batch size: 230, lr: 3.31e-02, grad_scale: 32.0 2023-06-18 01:03:37,389 INFO [train.py:996] (0/4) Epoch 1, batch 11700, loss[loss=0.3262, simple_loss=0.3562, pruned_loss=0.1481, over 21675.00 frames. ], tot_loss[loss=0.3522, simple_loss=0.4014, pruned_loss=0.1515, over 4270625.02 frames. ], batch size: 282, lr: 3.31e-02, grad_scale: 32.0 2023-06-18 01:03:49,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70200.0, ans=0.1 2023-06-18 01:04:22,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=70320.0, ans=0.2 2023-06-18 01:04:23,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.615e+02 4.235e+02 5.124e+02 6.443e+02 9.190e+02, threshold=1.025e+03, percent-clipped=0.0 2023-06-18 01:05:13,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-06-18 01:05:13,829 INFO [train.py:996] (0/4) Epoch 1, batch 11750, loss[loss=0.364, simple_loss=0.3943, pruned_loss=0.1669, over 21636.00 frames. ], tot_loss[loss=0.347, simple_loss=0.3913, pruned_loss=0.1513, over 4269713.19 frames. ], batch size: 263, lr: 3.30e-02, grad_scale: 32.0 2023-06-18 01:05:14,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=70500.0, ans=0.125 2023-06-18 01:05:52,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=70560.0, ans=0.0 2023-06-18 01:07:07,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70740.0, ans=0.125 2023-06-18 01:07:09,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70740.0, ans=0.125 2023-06-18 01:07:10,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70740.0, ans=0.125 2023-06-18 01:07:21,211 INFO [train.py:996] (0/4) Epoch 1, batch 11800, loss[loss=0.3217, simple_loss=0.3965, pruned_loss=0.1234, over 21575.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3958, pruned_loss=0.1566, over 4276520.16 frames. ], batch size: 230, lr: 3.30e-02, grad_scale: 32.0 2023-06-18 01:07:31,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=70800.0, ans=0.0 2023-06-18 01:08:23,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.432e+02 4.026e+02 5.092e+02 8.722e+02, threshold=8.051e+02, percent-clipped=0.0 2023-06-18 01:08:27,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-18 01:08:53,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-18 01:09:32,059 INFO [train.py:996] (0/4) Epoch 1, batch 11850, loss[loss=0.3044, simple_loss=0.377, pruned_loss=0.1159, over 21749.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3958, pruned_loss=0.1548, over 4280240.42 frames. ], batch size: 298, lr: 3.29e-02, grad_scale: 32.0 2023-06-18 01:09:45,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=71100.0, ans=0.2 2023-06-18 01:11:21,901 INFO [train.py:996] (0/4) Epoch 1, batch 11900, loss[loss=0.2933, simple_loss=0.3601, pruned_loss=0.1132, over 21696.00 frames. ], tot_loss[loss=0.351, simple_loss=0.3975, pruned_loss=0.1523, over 4279279.18 frames. ], batch size: 247, lr: 3.29e-02, grad_scale: 16.0 2023-06-18 01:11:26,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=71400.0, ans=0.125 2023-06-18 01:12:15,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=71460.0, ans=0.0 2023-06-18 01:12:27,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.389e+02 4.230e+02 4.928e+02 7.939e+02, threshold=8.459e+02, percent-clipped=0.0 2023-06-18 01:13:15,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=71640.0, ans=0.125 2023-06-18 01:13:35,571 INFO [train.py:996] (0/4) Epoch 1, batch 11950, loss[loss=0.2745, simple_loss=0.3392, pruned_loss=0.1049, over 21205.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3951, pruned_loss=0.1456, over 4277147.48 frames. ], batch size: 176, lr: 3.28e-02, grad_scale: 16.0 2023-06-18 01:13:49,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=71760.0, ans=0.125 2023-06-18 01:15:16,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=71880.0, ans=0.125 2023-06-18 01:15:31,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71940.0, ans=0.1 2023-06-18 01:15:39,662 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-12000.pt 2023-06-18 01:15:43,907 INFO [train.py:996] (0/4) Epoch 1, batch 12000, loss[loss=0.2988, simple_loss=0.3304, pruned_loss=0.1336, over 21516.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3876, pruned_loss=0.143, over 4273755.81 frames. ], batch size: 230, lr: 3.28e-02, grad_scale: 32.0 2023-06-18 01:15:43,909 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 01:16:39,560 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3214, simple_loss=0.4077, pruned_loss=0.1176, over 1796401.00 frames. 2023-06-18 01:16:39,561 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 01:17:03,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=72060.0, ans=0.125 2023-06-18 01:17:26,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.117e+02 3.794e+02 4.594e+02 6.987e+02, threshold=7.589e+02, percent-clipped=0.0 2023-06-18 01:17:41,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=72180.0, ans=0.125 2023-06-18 01:18:16,537 INFO [train.py:996] (0/4) Epoch 1, batch 12050, loss[loss=0.3841, simple_loss=0.4075, pruned_loss=0.1804, over 21867.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3866, pruned_loss=0.1466, over 4280368.09 frames. ], batch size: 371, lr: 3.27e-02, grad_scale: 32.0 2023-06-18 01:18:18,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-18 01:18:54,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=72360.0, ans=0.125 2023-06-18 01:19:15,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=72420.0, ans=0.0 2023-06-18 01:19:18,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=72420.0, ans=0.125 2023-06-18 01:20:03,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=72540.0, ans=0.125 2023-06-18 01:20:35,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=72540.0, ans=0.2 2023-06-18 01:20:39,881 INFO [train.py:996] (0/4) Epoch 1, batch 12100, loss[loss=0.3942, simple_loss=0.4183, pruned_loss=0.185, over 21384.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.3981, pruned_loss=0.1553, over 4282348.78 frames. ], batch size: 548, lr: 3.27e-02, grad_scale: 32.0 2023-06-18 01:21:32,850 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.780e+02 3.918e+02 4.969e+02 6.272e+02 1.033e+03, threshold=9.938e+02, percent-clipped=11.0 2023-06-18 01:22:51,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=72900.0, ans=0.125 2023-06-18 01:22:52,400 INFO [train.py:996] (0/4) Epoch 1, batch 12150, loss[loss=0.3886, simple_loss=0.4126, pruned_loss=0.1823, over 21805.00 frames. ], tot_loss[loss=0.3566, simple_loss=0.4018, pruned_loss=0.1557, over 4280157.93 frames. ], batch size: 124, lr: 3.26e-02, grad_scale: 32.0 2023-06-18 01:23:24,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=72900.0, ans=0.125 2023-06-18 01:23:50,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72960.0, ans=0.0 2023-06-18 01:24:12,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73080.0, ans=0.1 2023-06-18 01:24:32,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-18 01:25:22,165 INFO [train.py:996] (0/4) Epoch 1, batch 12200, loss[loss=0.2986, simple_loss=0.3314, pruned_loss=0.1329, over 21588.00 frames. ], tot_loss[loss=0.3528, simple_loss=0.3969, pruned_loss=0.1544, over 4275317.64 frames. ], batch size: 247, lr: 3.26e-02, grad_scale: 32.0 2023-06-18 01:25:39,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=73200.0, ans=0.0 2023-06-18 01:26:10,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.638e+02 4.548e+02 5.792e+02 8.635e+02, threshold=9.096e+02, percent-clipped=0.0 2023-06-18 01:26:34,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=73380.0, ans=0.05 2023-06-18 01:26:41,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=73380.0, ans=0.125 2023-06-18 01:27:17,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-18 01:27:17,666 INFO [train.py:996] (0/4) Epoch 1, batch 12250, loss[loss=0.2457, simple_loss=0.3147, pruned_loss=0.0883, over 21650.00 frames. ], tot_loss[loss=0.34, simple_loss=0.3854, pruned_loss=0.1473, over 4270384.57 frames. ], batch size: 247, lr: 3.25e-02, grad_scale: 32.0 2023-06-18 01:27:34,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=73500.0, ans=0.125 2023-06-18 01:28:34,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=73740.0, ans=0.125 2023-06-18 01:29:07,456 INFO [train.py:996] (0/4) Epoch 1, batch 12300, loss[loss=0.3182, simple_loss=0.3848, pruned_loss=0.1258, over 21746.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3724, pruned_loss=0.1359, over 4273433.06 frames. ], batch size: 332, lr: 3.25e-02, grad_scale: 32.0 2023-06-18 01:29:33,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-18 01:30:09,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.199e+02 4.044e+02 5.045e+02 8.506e+02, threshold=8.089e+02, percent-clipped=0.0 2023-06-18 01:30:14,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=73920.0, ans=0.125 2023-06-18 01:31:15,762 INFO [train.py:996] (0/4) Epoch 1, batch 12350, loss[loss=0.422, simple_loss=0.449, pruned_loss=0.1975, over 21732.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3754, pruned_loss=0.1361, over 4279059.34 frames. ], batch size: 441, lr: 3.24e-02, grad_scale: 32.0 2023-06-18 01:31:21,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=74100.0, ans=0.125 2023-06-18 01:32:23,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=74220.0, ans=0.125 2023-06-18 01:32:40,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=74280.0, ans=0.07 2023-06-18 01:32:46,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2023-06-18 01:33:09,312 INFO [train.py:996] (0/4) Epoch 1, batch 12400, loss[loss=0.3539, simple_loss=0.3886, pruned_loss=0.1596, over 21401.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3777, pruned_loss=0.1409, over 4276520.48 frames. ], batch size: 144, lr: 3.24e-02, grad_scale: 32.0 2023-06-18 01:33:27,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=74400.0, ans=0.0 2023-06-18 01:34:07,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 3.656e+02 4.405e+02 5.426e+02 8.475e+02, threshold=8.810e+02, percent-clipped=2.0 2023-06-18 01:34:45,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74640.0, ans=0.1 2023-06-18 01:35:22,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=74700.0, ans=0.2 2023-06-18 01:35:23,698 INFO [train.py:996] (0/4) Epoch 1, batch 12450, loss[loss=0.413, simple_loss=0.4457, pruned_loss=0.1901, over 21379.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.382, pruned_loss=0.1462, over 4276586.35 frames. ], batch size: 131, lr: 3.23e-02, grad_scale: 32.0 2023-06-18 01:36:32,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-18 01:36:48,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=74880.0, ans=0.0 2023-06-18 01:36:49,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=74880.0, ans=0.0 2023-06-18 01:37:43,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=75000.0, ans=0.2 2023-06-18 01:37:44,978 INFO [train.py:996] (0/4) Epoch 1, batch 12500, loss[loss=0.3606, simple_loss=0.4214, pruned_loss=0.1499, over 20797.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.3963, pruned_loss=0.1534, over 4274220.52 frames. ], batch size: 607, lr: 3.23e-02, grad_scale: 32.0 2023-06-18 01:37:56,038 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:37:57,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=75000.0, ans=0.0 2023-06-18 01:38:45,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=75120.0, ans=0.125 2023-06-18 01:38:52,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.484e+02 4.470e+02 5.562e+02 9.789e+02, threshold=8.941e+02, percent-clipped=2.0 2023-06-18 01:39:18,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=75180.0, ans=0.0 2023-06-18 01:39:33,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75240.0, ans=0.1 2023-06-18 01:39:38,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75240.0, ans=0.125 2023-06-18 01:40:06,055 INFO [train.py:996] (0/4) Epoch 1, batch 12550, loss[loss=0.3501, simple_loss=0.403, pruned_loss=0.1486, over 21975.00 frames. ], tot_loss[loss=0.3607, simple_loss=0.4052, pruned_loss=0.1581, over 4275316.26 frames. ], batch size: 317, lr: 3.22e-02, grad_scale: 16.0 2023-06-18 01:40:17,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-18 01:42:06,766 INFO [train.py:996] (0/4) Epoch 1, batch 12600, loss[loss=0.3561, simple_loss=0.4392, pruned_loss=0.1365, over 21309.00 frames. ], tot_loss[loss=0.3548, simple_loss=0.4023, pruned_loss=0.1536, over 4272677.87 frames. ], batch size: 549, lr: 3.22e-02, grad_scale: 16.0 2023-06-18 01:42:07,206 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:42:07,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-18 01:42:12,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75600.0, ans=0.1 2023-06-18 01:42:54,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=75660.0, ans=0.125 2023-06-18 01:43:24,483 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.361e+02 4.155e+02 4.807e+02 7.710e+02, threshold=8.311e+02, percent-clipped=0.0 2023-06-18 01:43:24,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=75720.0, ans=0.0 2023-06-18 01:43:56,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=75840.0, ans=0.035 2023-06-18 01:44:08,040 INFO [train.py:996] (0/4) Epoch 1, batch 12650, loss[loss=0.3331, simple_loss=0.3885, pruned_loss=0.1388, over 21306.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.392, pruned_loss=0.1458, over 4277022.27 frames. ], batch size: 548, lr: 3.21e-02, grad_scale: 16.0 2023-06-18 01:44:34,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-18 01:45:29,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-18 01:45:30,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76140.0, ans=0.125 2023-06-18 01:45:45,870 INFO [train.py:996] (0/4) Epoch 1, batch 12700, loss[loss=0.3681, simple_loss=0.4027, pruned_loss=0.1668, over 21929.00 frames. ], tot_loss[loss=0.3473, simple_loss=0.3934, pruned_loss=0.1506, over 4281630.01 frames. ], batch size: 372, lr: 3.21e-02, grad_scale: 16.0 2023-06-18 01:46:05,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.91 vs. limit=22.5 2023-06-18 01:46:08,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76260.0, ans=0.1 2023-06-18 01:46:26,256 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:46:39,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.163e+02 4.705e+02 5.622e+02 1.035e+03, threshold=9.411e+02, percent-clipped=4.0 2023-06-18 01:46:40,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=76320.0, ans=0.125 2023-06-18 01:47:03,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76440.0, ans=0.1 2023-06-18 01:47:05,432 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:47:14,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=76440.0, ans=0.125 2023-06-18 01:47:16,607 INFO [train.py:996] (0/4) Epoch 1, batch 12750, loss[loss=0.2912, simple_loss=0.3548, pruned_loss=0.1138, over 21632.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.3951, pruned_loss=0.1509, over 4282409.44 frames. ], batch size: 263, lr: 3.20e-02, grad_scale: 16.0 2023-06-18 01:48:41,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=76680.0, ans=0.025 2023-06-18 01:49:32,650 INFO [train.py:996] (0/4) Epoch 1, batch 12800, loss[loss=0.3899, simple_loss=0.4163, pruned_loss=0.1818, over 21846.00 frames. ], tot_loss[loss=0.3514, simple_loss=0.3958, pruned_loss=0.1535, over 4292429.99 frames. ], batch size: 371, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 01:50:05,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=76860.0, ans=0.125 2023-06-18 01:50:21,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=76860.0, ans=0.125 2023-06-18 01:50:22,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76920.0, ans=0.125 2023-06-18 01:50:25,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=76920.0, ans=0.2 2023-06-18 01:50:29,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.80 vs. limit=15.0 2023-06-18 01:50:33,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.808e+02 4.446e+02 5.808e+02 1.607e+03, threshold=8.892e+02, percent-clipped=7.0 2023-06-18 01:51:06,034 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:51:14,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=76980.0, ans=0.05 2023-06-18 01:51:35,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=77040.0, ans=0.125 2023-06-18 01:51:35,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=77040.0, ans=0.0 2023-06-18 01:51:59,023 INFO [train.py:996] (0/4) Epoch 1, batch 12850, loss[loss=0.4139, simple_loss=0.4664, pruned_loss=0.1807, over 19823.00 frames. ], tot_loss[loss=0.358, simple_loss=0.4019, pruned_loss=0.1571, over 4282539.52 frames. ], batch size: 704, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:52:10,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=77100.0, ans=0.125 2023-06-18 01:52:19,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=77100.0, ans=0.125 2023-06-18 01:52:38,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=77160.0, ans=0.125 2023-06-18 01:53:00,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=77280.0, ans=0.125 2023-06-18 01:53:03,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=77280.0, ans=0.0 2023-06-18 01:53:22,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-06-18 01:53:28,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=77280.0, ans=0.0 2023-06-18 01:54:12,623 INFO [train.py:996] (0/4) Epoch 1, batch 12900, loss[loss=0.238, simple_loss=0.3018, pruned_loss=0.08709, over 21785.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.398, pruned_loss=0.1514, over 4282987.15 frames. ], batch size: 118, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:54:50,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.899e+02 3.578e+02 4.051e+02 7.858e+02, threshold=7.156e+02, percent-clipped=0.0 2023-06-18 01:54:57,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=77580.0, ans=0.125 2023-06-18 01:55:01,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=77580.0, ans=0.125 2023-06-18 01:55:04,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=77580.0, ans=0.125 2023-06-18 01:55:06,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77580.0, ans=0.1 2023-06-18 01:55:57,447 INFO [train.py:996] (0/4) Epoch 1, batch 12950, loss[loss=0.3566, simple_loss=0.4034, pruned_loss=0.1549, over 21326.00 frames. ], tot_loss[loss=0.3434, simple_loss=0.3933, pruned_loss=0.1467, over 4280819.10 frames. ], batch size: 549, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 01:56:15,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-18 01:56:56,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=77820.0, ans=0.125 2023-06-18 01:57:03,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=77820.0, ans=0.09899494936611666 2023-06-18 01:57:06,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77820.0, ans=0.1 2023-06-18 01:58:03,851 INFO [train.py:996] (0/4) Epoch 1, batch 13000, loss[loss=0.3321, simple_loss=0.3858, pruned_loss=0.1392, over 21631.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.396, pruned_loss=0.1499, over 4277285.54 frames. ], batch size: 441, lr: 3.18e-02, grad_scale: 32.0 2023-06-18 01:58:20,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78060.0, ans=0.1 2023-06-18 01:58:25,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.03 vs. limit=6.0 2023-06-18 01:58:54,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.081e+02 4.107e+02 5.952e+02 9.573e+02, threshold=8.214e+02, percent-clipped=12.0 2023-06-18 01:59:20,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=78180.0, ans=0.0 2023-06-18 01:59:58,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=78240.0, ans=0.125 2023-06-18 02:00:08,201 INFO [train.py:996] (0/4) Epoch 1, batch 13050, loss[loss=0.2717, simple_loss=0.3299, pruned_loss=0.1067, over 17292.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3879, pruned_loss=0.1433, over 4276719.59 frames. ], batch size: 60, lr: 3.18e-02, grad_scale: 32.0 2023-06-18 02:01:09,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=78480.0, ans=0.2 2023-06-18 02:01:36,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-18 02:01:56,469 INFO [train.py:996] (0/4) Epoch 1, batch 13100, loss[loss=0.3306, simple_loss=0.3786, pruned_loss=0.1413, over 21485.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3893, pruned_loss=0.1433, over 4275801.45 frames. ], batch size: 194, lr: 3.17e-02, grad_scale: 32.0 2023-06-18 02:01:58,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=78600.0, ans=0.0 2023-06-18 02:02:30,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78660.0, ans=0.1 2023-06-18 02:03:16,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.573e+02 4.115e+02 5.192e+02 9.461e+02, threshold=8.229e+02, percent-clipped=5.0 2023-06-18 02:03:55,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=78840.0, ans=0.0 2023-06-18 02:04:05,261 INFO [train.py:996] (0/4) Epoch 1, batch 13150, loss[loss=0.3066, simple_loss=0.304, pruned_loss=0.1546, over 20164.00 frames. ], tot_loss[loss=0.3445, simple_loss=0.3921, pruned_loss=0.1485, over 4274334.70 frames. ], batch size: 710, lr: 3.17e-02, grad_scale: 32.0 2023-06-18 02:04:30,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=78900.0, ans=0.2 2023-06-18 02:05:02,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=78960.0, ans=0.2 2023-06-18 02:05:04,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=78960.0, ans=0.0 2023-06-18 02:05:30,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79020.0, ans=0.1 2023-06-18 02:06:18,760 INFO [train.py:996] (0/4) Epoch 1, batch 13200, loss[loss=0.3565, simple_loss=0.4326, pruned_loss=0.1402, over 20833.00 frames. ], tot_loss[loss=0.3408, simple_loss=0.3881, pruned_loss=0.1468, over 4274826.67 frames. ], batch size: 608, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 02:06:19,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=79200.0, ans=0.125 2023-06-18 02:06:31,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=79200.0, ans=0.0 2023-06-18 02:06:59,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=79260.0, ans=0.2 2023-06-18 02:07:43,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.259e+02 3.924e+02 5.244e+02 8.674e+02, threshold=7.848e+02, percent-clipped=1.0 2023-06-18 02:08:22,178 INFO [train.py:996] (0/4) Epoch 1, batch 13250, loss[loss=0.3631, simple_loss=0.4208, pruned_loss=0.1527, over 21673.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3889, pruned_loss=0.1493, over 4282401.19 frames. ], batch size: 414, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 02:08:29,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=79500.0, ans=0.2 2023-06-18 02:09:29,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=79560.0, ans=0.2 2023-06-18 02:09:46,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=79620.0, ans=0.0 2023-06-18 02:09:50,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=79680.0, ans=0.125 2023-06-18 02:09:56,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=79680.0, ans=0.125 2023-06-18 02:10:03,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=79680.0, ans=0.07 2023-06-18 02:10:34,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=79740.0, ans=0.125 2023-06-18 02:10:34,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79740.0, ans=0.1 2023-06-18 02:10:55,429 INFO [train.py:996] (0/4) Epoch 1, batch 13300, loss[loss=0.4227, simple_loss=0.4488, pruned_loss=0.1982, over 21737.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.393, pruned_loss=0.1497, over 4278782.69 frames. ], batch size: 441, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 02:11:13,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=79800.0, ans=0.2 2023-06-18 02:11:49,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.863e+02 4.396e+02 5.432e+02 9.138e+02, threshold=8.792e+02, percent-clipped=4.0 2023-06-18 02:12:23,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79980.0, ans=0.125 2023-06-18 02:12:57,189 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:13:08,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=80100.0, ans=0.125 2023-06-18 02:13:14,162 INFO [train.py:996] (0/4) Epoch 1, batch 13350, loss[loss=0.4995, simple_loss=0.4998, pruned_loss=0.2496, over 21398.00 frames. ], tot_loss[loss=0.353, simple_loss=0.3987, pruned_loss=0.1536, over 4279957.85 frames. ], batch size: 507, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 02:15:43,138 INFO [train.py:996] (0/4) Epoch 1, batch 13400, loss[loss=0.3739, simple_loss=0.4053, pruned_loss=0.1713, over 21736.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.4008, pruned_loss=0.1565, over 4287632.97 frames. ], batch size: 298, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:16:10,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=80460.0, ans=0.125 2023-06-18 02:16:33,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.828e+02 4.349e+02 5.476e+02 1.150e+03, threshold=8.698e+02, percent-clipped=4.0 2023-06-18 02:17:44,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=80640.0, ans=0.125 2023-06-18 02:17:55,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.45 vs. limit=22.5 2023-06-18 02:17:55,764 INFO [train.py:996] (0/4) Epoch 1, batch 13450, loss[loss=0.3359, simple_loss=0.3756, pruned_loss=0.1481, over 21361.00 frames. ], tot_loss[loss=0.3613, simple_loss=0.4027, pruned_loss=0.1599, over 4288559.49 frames. ], batch size: 194, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:18:29,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=80820.0, ans=0.2 2023-06-18 02:18:55,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=80880.0, ans=0.125 2023-06-18 02:19:33,606 INFO [train.py:996] (0/4) Epoch 1, batch 13500, loss[loss=0.3545, simple_loss=0.3968, pruned_loss=0.1562, over 21693.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3872, pruned_loss=0.1522, over 4284053.22 frames. ], batch size: 391, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 02:19:55,291 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:19:58,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=81000.0, ans=0.0 2023-06-18 02:20:35,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=81120.0, ans=0.125 2023-06-18 02:20:49,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.441e+02 4.151e+02 5.577e+02 1.126e+03, threshold=8.302e+02, percent-clipped=5.0 2023-06-18 02:21:47,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-18 02:22:09,602 INFO [train.py:996] (0/4) Epoch 1, batch 13550, loss[loss=0.406, simple_loss=0.4716, pruned_loss=0.1702, over 21697.00 frames. ], tot_loss[loss=0.3466, simple_loss=0.3915, pruned_loss=0.1509, over 4280158.90 frames. ], batch size: 389, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 02:22:22,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=81300.0, ans=0.125 2023-06-18 02:22:25,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81360.0, ans=0.1 2023-06-18 02:23:33,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81480.0, ans=0.125 2023-06-18 02:24:02,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=81480.0, ans=0.125 2023-06-18 02:24:12,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2023-06-18 02:24:18,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=81540.0, ans=0.025 2023-06-18 02:24:21,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=81540.0, ans=0.125 2023-06-18 02:24:22,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=81540.0, ans=0.125 2023-06-18 02:24:25,309 INFO [train.py:996] (0/4) Epoch 1, batch 13600, loss[loss=0.3413, simple_loss=0.3857, pruned_loss=0.1485, over 16376.00 frames. ], tot_loss[loss=0.3487, simple_loss=0.3942, pruned_loss=0.1516, over 4277781.36 frames. ], batch size: 61, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 02:25:25,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.459e+02 4.202e+02 5.582e+02 9.308e+02, threshold=8.405e+02, percent-clipped=4.0 2023-06-18 02:25:32,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81720.0, ans=0.1 2023-06-18 02:26:25,969 INFO [train.py:996] (0/4) Epoch 1, batch 13650, loss[loss=0.2936, simple_loss=0.3419, pruned_loss=0.1227, over 21637.00 frames. ], tot_loss[loss=0.3416, simple_loss=0.3886, pruned_loss=0.1472, over 4270516.10 frames. ], batch size: 332, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 02:26:36,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81900.0, ans=0.1 2023-06-18 02:26:46,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=81960.0, ans=0.125 2023-06-18 02:26:54,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=81960.0, ans=0.0 2023-06-18 02:27:48,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=82080.0, ans=0.2 2023-06-18 02:28:20,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=82140.0, ans=0.125 2023-06-18 02:28:25,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82200.0, ans=0.1 2023-06-18 02:28:26,805 INFO [train.py:996] (0/4) Epoch 1, batch 13700, loss[loss=0.3442, simple_loss=0.386, pruned_loss=0.1512, over 21775.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3827, pruned_loss=0.1467, over 4266259.94 frames. ], batch size: 332, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 02:28:44,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=82200.0, ans=0.125 2023-06-18 02:29:46,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.308e+02 4.161e+02 5.184e+02 1.059e+03, threshold=8.322e+02, percent-clipped=1.0 2023-06-18 02:30:08,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=82380.0, ans=0.125 2023-06-18 02:30:12,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=82380.0, ans=0.0 2023-06-18 02:30:39,908 INFO [train.py:996] (0/4) Epoch 1, batch 13750, loss[loss=0.338, simple_loss=0.3985, pruned_loss=0.1388, over 21584.00 frames. ], tot_loss[loss=0.3327, simple_loss=0.3784, pruned_loss=0.1435, over 4265385.33 frames. ], batch size: 441, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:30:41,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=82500.0, ans=0.0 2023-06-18 02:31:47,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=82620.0, ans=0.0 2023-06-18 02:33:00,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=82740.0, ans=0.2 2023-06-18 02:33:17,470 INFO [train.py:996] (0/4) Epoch 1, batch 13800, loss[loss=0.2237, simple_loss=0.2638, pruned_loss=0.0918, over 16390.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3837, pruned_loss=0.1428, over 4261458.28 frames. ], batch size: 61, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:33:19,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=82800.0, ans=0.2 2023-06-18 02:33:47,461 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:34:11,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=82860.0, ans=0.125 2023-06-18 02:34:28,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.529e+02 4.559e+02 5.712e+02 9.830e+02, threshold=9.119e+02, percent-clipped=1.0 2023-06-18 02:34:33,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=82920.0, ans=0.2 2023-06-18 02:34:42,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-18 02:35:49,009 INFO [train.py:996] (0/4) Epoch 1, batch 13850, loss[loss=0.3718, simple_loss=0.4187, pruned_loss=0.1625, over 21787.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3891, pruned_loss=0.1437, over 4266722.85 frames. ], batch size: 282, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 02:35:52,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=83100.0, ans=0.02 2023-06-18 02:35:52,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-06-18 02:37:16,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=83340.0, ans=0.0 2023-06-18 02:37:30,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=83340.0, ans=0.125 2023-06-18 02:37:55,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83340.0, ans=0.125 2023-06-18 02:37:59,150 INFO [train.py:996] (0/4) Epoch 1, batch 13900, loss[loss=0.3274, simple_loss=0.368, pruned_loss=0.1434, over 21683.00 frames. ], tot_loss[loss=0.3481, simple_loss=0.3952, pruned_loss=0.1505, over 4267638.92 frames. ], batch size: 263, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 02:38:06,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=83400.0, ans=0.125 2023-06-18 02:38:46,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 3.529e+02 4.116e+02 5.133e+02 1.052e+03, threshold=8.231e+02, percent-clipped=3.0 2023-06-18 02:39:03,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=83580.0, ans=0.125 2023-06-18 02:39:09,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=83640.0, ans=0.125 2023-06-18 02:39:43,492 INFO [train.py:996] (0/4) Epoch 1, batch 13950, loss[loss=0.3621, simple_loss=0.3986, pruned_loss=0.1627, over 21845.00 frames. ], tot_loss[loss=0.3529, simple_loss=0.3972, pruned_loss=0.1543, over 4277171.84 frames. ], batch size: 351, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 02:40:19,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=83700.0, ans=0.125 2023-06-18 02:40:31,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=83760.0, ans=0.2 2023-06-18 02:40:39,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=83820.0, ans=0.0 2023-06-18 02:40:44,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=83820.0, ans=0.0 2023-06-18 02:40:52,467 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:41:57,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=84000.0, ans=0.125 2023-06-18 02:41:58,059 INFO [train.py:996] (0/4) Epoch 1, batch 14000, loss[loss=0.3704, simple_loss=0.4731, pruned_loss=0.1338, over 20872.00 frames. ], tot_loss[loss=0.342, simple_loss=0.3882, pruned_loss=0.1479, over 4266603.93 frames. ], batch size: 607, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 02:42:20,810 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:42:30,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84060.0, ans=0.1 2023-06-18 02:42:41,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 3.528e+02 4.428e+02 5.328e+02 9.334e+02, threshold=8.857e+02, percent-clipped=2.0 2023-06-18 02:42:53,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=15.0 2023-06-18 02:43:49,529 INFO [train.py:996] (0/4) Epoch 1, batch 14050, loss[loss=0.2961, simple_loss=0.3391, pruned_loss=0.1265, over 21942.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3836, pruned_loss=0.1426, over 4265194.94 frames. ], batch size: 113, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 02:43:51,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=84300.0, ans=0.125 2023-06-18 02:44:31,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-18 02:44:33,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84420.0, ans=0.1 2023-06-18 02:45:08,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=84480.0, ans=0.2 2023-06-18 02:45:44,430 INFO [train.py:996] (0/4) Epoch 1, batch 14100, loss[loss=0.332, simple_loss=0.3681, pruned_loss=0.148, over 21240.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3777, pruned_loss=0.1422, over 4248820.19 frames. ], batch size: 176, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 02:45:59,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.98 vs. limit=15.0 2023-06-18 02:46:02,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=84600.0, ans=0.2 2023-06-18 02:46:05,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=84660.0, ans=0.125 2023-06-18 02:46:47,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 3.554e+02 4.205e+02 5.494e+02 8.066e+02, threshold=8.411e+02, percent-clipped=0.0 2023-06-18 02:47:39,390 INFO [train.py:996] (0/4) Epoch 1, batch 14150, loss[loss=0.2732, simple_loss=0.3456, pruned_loss=0.1004, over 21772.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3789, pruned_loss=0.1422, over 4254110.88 frames. ], batch size: 102, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 02:47:48,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84900.0, ans=0.1 2023-06-18 02:47:49,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=84900.0, ans=0.125 2023-06-18 02:49:02,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85080.0, ans=0.0 2023-06-18 02:49:05,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85140.0, ans=0.1 2023-06-18 02:49:32,056 INFO [train.py:996] (0/4) Epoch 1, batch 14200, loss[loss=0.3027, simple_loss=0.3414, pruned_loss=0.132, over 21784.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3753, pruned_loss=0.1387, over 4263293.40 frames. ], batch size: 98, lr: 3.08e-02, grad_scale: 16.0 2023-06-18 02:49:43,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85200.0, ans=0.1 2023-06-18 02:50:24,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=85320.0, ans=0.125 2023-06-18 02:50:25,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=85320.0, ans=0.0 2023-06-18 02:50:28,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.158e+02 3.617e+02 4.610e+02 1.061e+03, threshold=7.235e+02, percent-clipped=3.0 2023-06-18 02:51:23,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=85440.0, ans=0.125 2023-06-18 02:51:31,282 INFO [train.py:996] (0/4) Epoch 1, batch 14250, loss[loss=0.3415, simple_loss=0.3767, pruned_loss=0.1532, over 21964.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3705, pruned_loss=0.1381, over 4258685.36 frames. ], batch size: 103, lr: 3.07e-02, grad_scale: 16.0 2023-06-18 02:51:40,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=85500.0, ans=0.125 2023-06-18 02:51:41,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=85500.0, ans=0.125 2023-06-18 02:51:45,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85500.0, ans=0.0 2023-06-18 02:51:56,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=85560.0, ans=0.125 2023-06-18 02:52:11,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-18 02:52:21,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-18 02:52:23,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=85620.0, ans=0.0 2023-06-18 02:53:09,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-06-18 02:53:19,409 INFO [train.py:996] (0/4) Epoch 1, batch 14300, loss[loss=0.3676, simple_loss=0.4284, pruned_loss=0.1534, over 21424.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3738, pruned_loss=0.1392, over 4262622.36 frames. ], batch size: 211, lr: 3.07e-02, grad_scale: 16.0 2023-06-18 02:54:04,054 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:54:23,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.210e+02 4.117e+02 5.928e+02 1.578e+03, threshold=8.234e+02, percent-clipped=19.0 2023-06-18 02:55:30,418 INFO [train.py:996] (0/4) Epoch 1, batch 14350, loss[loss=0.171, simple_loss=0.1895, pruned_loss=0.07629, over 17155.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3774, pruned_loss=0.1393, over 4267268.53 frames. ], batch size: 61, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 02:56:47,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=86340.0, ans=0.125 2023-06-18 02:56:47,083 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:57:12,505 INFO [train.py:996] (0/4) Epoch 1, batch 14400, loss[loss=0.3008, simple_loss=0.3384, pruned_loss=0.1316, over 21433.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3757, pruned_loss=0.1409, over 4275318.11 frames. ], batch size: 211, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 02:57:56,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.630e+02 4.233e+02 4.966e+02 8.953e+02, threshold=8.465e+02, percent-clipped=1.0 2023-06-18 02:57:57,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.70 vs. limit=15.0 2023-06-18 02:58:31,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-18 02:59:04,518 INFO [train.py:996] (0/4) Epoch 1, batch 14450, loss[loss=0.3322, simple_loss=0.364, pruned_loss=0.1502, over 21748.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3715, pruned_loss=0.1412, over 4269174.37 frames. ], batch size: 333, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 02:59:11,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=86700.0, ans=0.0 2023-06-18 02:59:36,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=86760.0, ans=0.0 2023-06-18 03:00:06,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86880.0, ans=0.1 2023-06-18 03:00:34,012 INFO [train.py:996] (0/4) Epoch 1, batch 14500, loss[loss=0.3099, simple_loss=0.3453, pruned_loss=0.1373, over 21244.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3697, pruned_loss=0.1409, over 4266763.10 frames. ], batch size: 159, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 03:00:45,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=87000.0, ans=0.025 2023-06-18 03:01:23,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.289e+02 4.039e+02 4.928e+02 1.223e+03, threshold=8.079e+02, percent-clipped=4.0 2023-06-18 03:02:37,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=87240.0, ans=0.0 2023-06-18 03:02:40,189 INFO [train.py:996] (0/4) Epoch 1, batch 14550, loss[loss=0.3303, simple_loss=0.38, pruned_loss=0.1403, over 21350.00 frames. ], tot_loss[loss=0.3333, simple_loss=0.3777, pruned_loss=0.1444, over 4271645.55 frames. ], batch size: 159, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 03:02:54,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=87360.0, ans=0.125 2023-06-18 03:02:56,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=87360.0, ans=0.95 2023-06-18 03:03:50,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=87480.0, ans=0.0 2023-06-18 03:03:59,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-18 03:04:01,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.37 vs. limit=22.5 2023-06-18 03:04:37,687 INFO [train.py:996] (0/4) Epoch 1, batch 14600, loss[loss=0.3277, simple_loss=0.3909, pruned_loss=0.1323, over 21323.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3881, pruned_loss=0.1513, over 4278994.06 frames. ], batch size: 176, lr: 3.04e-02, grad_scale: 32.0 2023-06-18 03:04:57,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=87600.0, ans=0.0 2023-06-18 03:05:37,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.686e+02 4.597e+02 5.767e+02 8.268e+02, threshold=9.195e+02, percent-clipped=3.0 2023-06-18 03:05:58,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87780.0, ans=0.1 2023-06-18 03:05:58,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=87780.0, ans=0.2 2023-06-18 03:06:31,183 INFO [train.py:996] (0/4) Epoch 1, batch 14650, loss[loss=0.2615, simple_loss=0.3347, pruned_loss=0.09418, over 21842.00 frames. ], tot_loss[loss=0.3448, simple_loss=0.3904, pruned_loss=0.1496, over 4282108.88 frames. ], batch size: 316, lr: 3.04e-02, grad_scale: 32.0 2023-06-18 03:06:45,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-18 03:06:57,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=87960.0, ans=0.125 2023-06-18 03:07:56,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-18 03:08:05,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=88140.0, ans=0.125 2023-06-18 03:08:23,522 INFO [train.py:996] (0/4) Epoch 1, batch 14700, loss[loss=0.2763, simple_loss=0.3384, pruned_loss=0.1071, over 21808.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.381, pruned_loss=0.1407, over 4281425.92 frames. ], batch size: 124, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:09:28,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-18 03:09:31,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-18 03:09:37,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 3.037e+02 3.906e+02 4.878e+02 1.107e+03, threshold=7.811e+02, percent-clipped=1.0 2023-06-18 03:09:39,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=88320.0, ans=0.125 2023-06-18 03:09:50,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=88380.0, ans=0.07 2023-06-18 03:10:15,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-18 03:10:24,998 INFO [train.py:996] (0/4) Epoch 1, batch 14750, loss[loss=0.3205, simple_loss=0.3529, pruned_loss=0.144, over 16596.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3876, pruned_loss=0.145, over 4279476.36 frames. ], batch size: 60, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:10:35,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88500.0, ans=0.1 2023-06-18 03:10:56,183 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:10:56,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-18 03:12:18,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88680.0, ans=0.125 2023-06-18 03:12:47,663 INFO [train.py:996] (0/4) Epoch 1, batch 14800, loss[loss=0.3368, simple_loss=0.377, pruned_loss=0.1483, over 21508.00 frames. ], tot_loss[loss=0.3537, simple_loss=0.4011, pruned_loss=0.1531, over 4279187.37 frames. ], batch size: 230, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 03:12:51,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=88800.0, ans=0.07 2023-06-18 03:12:58,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=88800.0, ans=0.2 2023-06-18 03:13:02,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-18 03:13:46,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.441e+02 3.653e+02 4.322e+02 5.427e+02 8.202e+02, threshold=8.644e+02, percent-clipped=2.0 2023-06-18 03:13:52,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=88980.0, ans=0.2 2023-06-18 03:14:47,072 INFO [train.py:996] (0/4) Epoch 1, batch 14850, loss[loss=0.393, simple_loss=0.4387, pruned_loss=0.1737, over 21645.00 frames. ], tot_loss[loss=0.3483, simple_loss=0.3931, pruned_loss=0.1518, over 4277742.07 frames. ], batch size: 414, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 03:14:48,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-18 03:15:19,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89100.0, ans=0.1 2023-06-18 03:15:26,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=89160.0, ans=0.2 2023-06-18 03:15:26,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=89160.0, ans=0.125 2023-06-18 03:15:26,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=89160.0, ans=0.125 2023-06-18 03:15:50,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=89160.0, ans=0.125 2023-06-18 03:16:09,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=89220.0, ans=0.0 2023-06-18 03:16:16,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=89280.0, ans=0.125 2023-06-18 03:16:56,802 INFO [train.py:996] (0/4) Epoch 1, batch 14900, loss[loss=0.3907, simple_loss=0.4253, pruned_loss=0.1781, over 21707.00 frames. ], tot_loss[loss=0.3524, simple_loss=0.3973, pruned_loss=0.1538, over 4274646.51 frames. ], batch size: 351, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 03:17:05,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=89400.0, ans=0.125 2023-06-18 03:17:52,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.509e+02 4.418e+02 5.707e+02 9.536e+02, threshold=8.836e+02, percent-clipped=5.0 2023-06-18 03:18:45,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=89640.0, ans=0.125 2023-06-18 03:19:03,621 INFO [train.py:996] (0/4) Epoch 1, batch 14950, loss[loss=0.2967, simple_loss=0.3563, pruned_loss=0.1186, over 21202.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3984, pruned_loss=0.1526, over 4271364.45 frames. ], batch size: 143, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 03:19:16,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=89700.0, ans=0.125 2023-06-18 03:19:42,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-18 03:20:55,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=89940.0, ans=0.0 2023-06-18 03:20:59,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=89940.0, ans=0.125 2023-06-18 03:21:07,188 INFO [train.py:996] (0/4) Epoch 1, batch 15000, loss[loss=0.3396, simple_loss=0.3737, pruned_loss=0.1527, over 21871.00 frames. ], tot_loss[loss=0.3558, simple_loss=0.4006, pruned_loss=0.1555, over 4272665.21 frames. ], batch size: 98, lr: 3.01e-02, grad_scale: 16.0 2023-06-18 03:21:07,189 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 03:21:55,986 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3047, simple_loss=0.3953, pruned_loss=0.107, over 1796401.00 frames. 2023-06-18 03:21:55,988 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 03:22:17,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=90060.0, ans=0.05 2023-06-18 03:22:20,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=90060.0, ans=0.125 2023-06-18 03:22:23,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=90060.0, ans=0.125 2023-06-18 03:22:25,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=90060.0, ans=0.125 2023-06-18 03:22:48,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.680e+02 4.733e+02 5.350e+02 8.092e+02, threshold=9.466e+02, percent-clipped=0.0 2023-06-18 03:23:10,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=90180.0, ans=0.2 2023-06-18 03:23:12,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=90180.0, ans=0.2 2023-06-18 03:23:25,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=90240.0, ans=0.04949747468305833 2023-06-18 03:23:51,333 INFO [train.py:996] (0/4) Epoch 1, batch 15050, loss[loss=0.3666, simple_loss=0.4129, pruned_loss=0.1602, over 21656.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.402, pruned_loss=0.1566, over 4260154.20 frames. ], batch size: 389, lr: 3.01e-02, grad_scale: 16.0 2023-06-18 03:25:30,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-18 03:25:57,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-18 03:26:03,915 INFO [train.py:996] (0/4) Epoch 1, batch 15100, loss[loss=0.4483, simple_loss=0.4634, pruned_loss=0.2166, over 21463.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.4026, pruned_loss=0.1554, over 4265722.28 frames. ], batch size: 471, lr: 3.00e-02, grad_scale: 16.0 2023-06-18 03:26:34,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=90660.0, ans=0.0 2023-06-18 03:26:43,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=90720.0, ans=10.0 2023-06-18 03:26:54,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.766e+02 4.918e+02 6.265e+02 9.726e+02, threshold=9.837e+02, percent-clipped=1.0 2023-06-18 03:27:44,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=90840.0, ans=0.07 2023-06-18 03:27:51,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=90840.0, ans=0.0 2023-06-18 03:27:56,496 INFO [train.py:996] (0/4) Epoch 1, batch 15150, loss[loss=0.2952, simple_loss=0.3317, pruned_loss=0.1294, over 21586.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.3994, pruned_loss=0.1566, over 4268655.68 frames. ], batch size: 231, lr: 3.00e-02, grad_scale: 16.0 2023-06-18 03:28:24,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-18 03:29:39,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=91140.0, ans=0.125 2023-06-18 03:30:03,093 INFO [train.py:996] (0/4) Epoch 1, batch 15200, loss[loss=0.3105, simple_loss=0.3663, pruned_loss=0.1274, over 21597.00 frames. ], tot_loss[loss=0.3446, simple_loss=0.3877, pruned_loss=0.1507, over 4271914.66 frames. ], batch size: 263, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:30:03,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=91200.0, ans=0.125 2023-06-18 03:30:49,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-18 03:30:57,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 3.451e+02 4.223e+02 5.136e+02 8.420e+02, threshold=8.446e+02, percent-clipped=0.0 2023-06-18 03:31:07,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=91320.0, ans=0.95 2023-06-18 03:31:15,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=91380.0, ans=0.2 2023-06-18 03:31:39,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=91440.0, ans=0.0 2023-06-18 03:31:59,049 INFO [train.py:996] (0/4) Epoch 1, batch 15250, loss[loss=0.3158, simple_loss=0.3509, pruned_loss=0.1403, over 21242.00 frames. ], tot_loss[loss=0.3408, simple_loss=0.3832, pruned_loss=0.1492, over 4269382.42 frames. ], batch size: 176, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:32:13,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91500.0, ans=0.1 2023-06-18 03:33:26,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-18 03:33:42,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-18 03:34:02,104 INFO [train.py:996] (0/4) Epoch 1, batch 15300, loss[loss=0.3542, simple_loss=0.4046, pruned_loss=0.1519, over 21284.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3865, pruned_loss=0.1536, over 4266637.54 frames. ], batch size: 143, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 03:34:17,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=91800.0, ans=0.2 2023-06-18 03:34:21,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-18 03:34:43,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=91860.0, ans=0.125 2023-06-18 03:35:10,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.694e+02 4.588e+02 5.460e+02 1.157e+03, threshold=9.176e+02, percent-clipped=1.0 2023-06-18 03:36:16,124 INFO [train.py:996] (0/4) Epoch 1, batch 15350, loss[loss=0.2992, simple_loss=0.368, pruned_loss=0.1151, over 21444.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3916, pruned_loss=0.1562, over 4269343.60 frames. ], batch size: 211, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 03:38:08,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-18 03:38:13,657 INFO [train.py:996] (0/4) Epoch 1, batch 15400, loss[loss=0.2815, simple_loss=0.3591, pruned_loss=0.1019, over 21834.00 frames. ], tot_loss[loss=0.351, simple_loss=0.3927, pruned_loss=0.1546, over 4267224.91 frames. ], batch size: 102, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 03:38:26,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-18 03:38:34,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=92460.0, ans=0.125 2023-06-18 03:38:37,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=6.0 2023-06-18 03:38:46,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=92460.0, ans=0.125 2023-06-18 03:39:06,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.578e+02 4.472e+02 5.706e+02 1.204e+03, threshold=8.945e+02, percent-clipped=6.0 2023-06-18 03:39:36,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=92580.0, ans=0.0 2023-06-18 03:39:57,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=92640.0, ans=0.0 2023-06-18 03:40:04,062 INFO [train.py:996] (0/4) Epoch 1, batch 15450, loss[loss=0.3121, simple_loss=0.3715, pruned_loss=0.1264, over 21386.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3898, pruned_loss=0.1529, over 4261200.87 frames. ], batch size: 211, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 03:40:07,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=92700.0, ans=0.2 2023-06-18 03:40:10,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=92700.0, ans=0.0 2023-06-18 03:40:39,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=92820.0, ans=0.0 2023-06-18 03:41:14,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.35 vs. limit=10.0 2023-06-18 03:41:24,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=92940.0, ans=10.0 2023-06-18 03:41:29,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=92940.0, ans=0.125 2023-06-18 03:41:34,863 INFO [train.py:996] (0/4) Epoch 1, batch 15500, loss[loss=0.362, simple_loss=0.4045, pruned_loss=0.1598, over 21917.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.392, pruned_loss=0.1519, over 4257828.69 frames. ], batch size: 316, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 03:42:12,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=93120.0, ans=0.125 2023-06-18 03:42:38,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.182e+02 4.008e+02 5.609e+02 1.059e+03, threshold=8.016e+02, percent-clipped=3.0 2023-06-18 03:42:45,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 03:42:45,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=22.5 2023-06-18 03:43:41,837 INFO [train.py:996] (0/4) Epoch 1, batch 15550, loss[loss=0.2997, simple_loss=0.3624, pruned_loss=0.1185, over 21726.00 frames. ], tot_loss[loss=0.3442, simple_loss=0.3917, pruned_loss=0.1483, over 4260546.72 frames. ], batch size: 332, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 03:43:55,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=93360.0, ans=0.125 2023-06-18 03:44:13,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=93360.0, ans=0.0 2023-06-18 03:45:25,077 INFO [train.py:996] (0/4) Epoch 1, batch 15600, loss[loss=0.3223, simple_loss=0.3576, pruned_loss=0.1435, over 21586.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3851, pruned_loss=0.1455, over 4254404.45 frames. ], batch size: 332, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 03:45:31,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=93600.0, ans=0.0 2023-06-18 03:46:19,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=93660.0, ans=0.0 2023-06-18 03:46:23,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93720.0, ans=0.1 2023-06-18 03:46:40,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.054e+02 4.153e+02 4.865e+02 7.806e+02, threshold=8.306e+02, percent-clipped=0.0 2023-06-18 03:47:35,558 INFO [train.py:996] (0/4) Epoch 1, batch 15650, loss[loss=0.3316, simple_loss=0.3684, pruned_loss=0.1474, over 21750.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3811, pruned_loss=0.1439, over 4264265.51 frames. ], batch size: 351, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 03:47:35,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=93900.0, ans=0.125 2023-06-18 03:47:45,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=93900.0, ans=0.0 2023-06-18 03:47:49,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93960.0, ans=0.1 2023-06-18 03:48:55,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94080.0, ans=0.1 2023-06-18 03:49:33,341 INFO [train.py:996] (0/4) Epoch 1, batch 15700, loss[loss=0.3004, simple_loss=0.3419, pruned_loss=0.1294, over 21839.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3754, pruned_loss=0.1418, over 4266954.06 frames. ], batch size: 98, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:49:58,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-18 03:50:40,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=94320.0, ans=0.125 2023-06-18 03:50:43,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=15.0 2023-06-18 03:50:47,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.466e+02 4.238e+02 5.232e+02 9.594e+02, threshold=8.475e+02, percent-clipped=2.0 2023-06-18 03:51:16,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=94440.0, ans=0.0 2023-06-18 03:51:24,783 INFO [train.py:996] (0/4) Epoch 1, batch 15750, loss[loss=0.3196, simple_loss=0.3563, pruned_loss=0.1415, over 21778.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3706, pruned_loss=0.1415, over 4261754.52 frames. ], batch size: 112, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:51:29,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94500.0, ans=0.125 2023-06-18 03:51:30,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-18 03:51:34,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=94500.0, ans=0.125 2023-06-18 03:52:30,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=94680.0, ans=10.0 2023-06-18 03:53:09,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=94740.0, ans=0.0 2023-06-18 03:53:13,090 INFO [train.py:996] (0/4) Epoch 1, batch 15800, loss[loss=0.3108, simple_loss=0.3402, pruned_loss=0.1407, over 21449.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.365, pruned_loss=0.1402, over 4259563.40 frames. ], batch size: 212, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 03:54:01,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-18 03:54:03,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=94860.0, ans=0.1 2023-06-18 03:54:24,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 3.502e+02 4.028e+02 5.074e+02 9.979e+02, threshold=8.057e+02, percent-clipped=5.0 2023-06-18 03:54:55,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=94980.0, ans=0.05 2023-06-18 03:54:57,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=95040.0, ans=0.0 2023-06-18 03:55:14,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-18 03:55:22,918 INFO [train.py:996] (0/4) Epoch 1, batch 15850, loss[loss=0.2788, simple_loss=0.3153, pruned_loss=0.1211, over 21798.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.366, pruned_loss=0.1427, over 4270472.59 frames. ], batch size: 107, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:55:28,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-18 03:55:29,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.75 vs. limit=22.5 2023-06-18 03:56:22,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95220.0, ans=0.1 2023-06-18 03:56:23,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-18 03:56:30,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=95220.0, ans=0.2 2023-06-18 03:56:33,599 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.472e-01 2023-06-18 03:56:52,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-18 03:56:54,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=95280.0, ans=0.125 2023-06-18 03:57:41,513 INFO [train.py:996] (0/4) Epoch 1, batch 15900, loss[loss=0.3508, simple_loss=0.3891, pruned_loss=0.1562, over 21688.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3673, pruned_loss=0.1444, over 4268218.78 frames. ], batch size: 351, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 03:58:39,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.625e+02 4.216e+02 5.103e+02 7.976e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:58:59,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=95640.0, ans=0.125 2023-06-18 03:59:18,976 INFO [train.py:996] (0/4) Epoch 1, batch 15950, loss[loss=0.2958, simple_loss=0.3761, pruned_loss=0.1077, over 21639.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3668, pruned_loss=0.1413, over 4258990.21 frames. ], batch size: 389, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 04:00:08,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=95820.0, ans=0.0 2023-06-18 04:00:29,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95880.0, ans=0.1 2023-06-18 04:00:35,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=95940.0, ans=0.125 2023-06-18 04:00:36,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=95940.0, ans=0.2 2023-06-18 04:00:49,126 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-16000.pt 2023-06-18 04:00:53,716 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:00:55,797 INFO [train.py:996] (0/4) Epoch 1, batch 16000, loss[loss=0.3108, simple_loss=0.3878, pruned_loss=0.1169, over 21869.00 frames. ], tot_loss[loss=0.321, simple_loss=0.367, pruned_loss=0.1375, over 4268001.03 frames. ], batch size: 371, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 04:01:49,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.964e+02 3.608e+02 4.425e+02 8.344e+02, threshold=7.217e+02, percent-clipped=0.0 2023-06-18 04:01:54,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96180.0, ans=0.125 2023-06-18 04:02:22,048 INFO [train.py:996] (0/4) Epoch 1, batch 16050, loss[loss=0.3234, simple_loss=0.3924, pruned_loss=0.1272, over 21620.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3695, pruned_loss=0.1334, over 4268284.29 frames. ], batch size: 263, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 04:02:49,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=96360.0, ans=0.0 2023-06-18 04:02:52,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=96360.0, ans=0.125 2023-06-18 04:03:55,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=96480.0, ans=10.0 2023-06-18 04:04:15,092 INFO [train.py:996] (0/4) Epoch 1, batch 16100, loss[loss=0.3331, simple_loss=0.3802, pruned_loss=0.143, over 21985.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3756, pruned_loss=0.1359, over 4274554.99 frames. ], batch size: 113, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:04:16,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96600.0, ans=0.125 2023-06-18 04:04:16,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=96600.0, ans=0.125 2023-06-18 04:05:10,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=96720.0, ans=0.0 2023-06-18 04:05:25,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=96720.0, ans=0.125 2023-06-18 04:05:25,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2023-06-18 04:05:38,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 3.663e+02 4.522e+02 5.171e+02 8.163e+02, threshold=9.044e+02, percent-clipped=3.0 2023-06-18 04:06:11,406 INFO [train.py:996] (0/4) Epoch 1, batch 16150, loss[loss=0.3505, simple_loss=0.3955, pruned_loss=0.1528, over 21906.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3782, pruned_loss=0.1393, over 4286507.71 frames. ], batch size: 371, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:07:45,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-18 04:07:51,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-18 04:08:24,768 INFO [train.py:996] (0/4) Epoch 1, batch 16200, loss[loss=0.3493, simple_loss=0.4063, pruned_loss=0.1462, over 21720.00 frames. ], tot_loss[loss=0.3336, simple_loss=0.3837, pruned_loss=0.1418, over 4287413.59 frames. ], batch size: 298, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 04:08:33,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=97200.0, ans=0.0 2023-06-18 04:09:40,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97320.0, ans=0.0 2023-06-18 04:09:41,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.302e+02 3.926e+02 4.944e+02 8.800e+02, threshold=7.851e+02, percent-clipped=0.0 2023-06-18 04:09:44,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=97380.0, ans=0.0 2023-06-18 04:10:19,004 INFO [train.py:996] (0/4) Epoch 1, batch 16250, loss[loss=0.2318, simple_loss=0.293, pruned_loss=0.08536, over 21452.00 frames. ], tot_loss[loss=0.3334, simple_loss=0.3824, pruned_loss=0.1422, over 4285430.16 frames. ], batch size: 195, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:11:18,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=97620.0, ans=0.2 2023-06-18 04:11:33,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=97680.0, ans=0.0 2023-06-18 04:12:06,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97740.0, ans=0.125 2023-06-18 04:12:18,164 INFO [train.py:996] (0/4) Epoch 1, batch 16300, loss[loss=0.3716, simple_loss=0.4032, pruned_loss=0.17, over 20062.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3766, pruned_loss=0.1379, over 4282049.75 frames. ], batch size: 702, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:12:56,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=97860.0, ans=0.125 2023-06-18 04:13:19,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.969e+02 3.585e+02 4.834e+02 8.506e+02, threshold=7.169e+02, percent-clipped=1.0 2023-06-18 04:14:26,819 INFO [train.py:996] (0/4) Epoch 1, batch 16350, loss[loss=0.4298, simple_loss=0.4534, pruned_loss=0.2031, over 21790.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3785, pruned_loss=0.1411, over 4277972.14 frames. ], batch size: 441, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 04:16:37,186 INFO [train.py:996] (0/4) Epoch 1, batch 16400, loss[loss=0.3758, simple_loss=0.4103, pruned_loss=0.1707, over 21859.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3818, pruned_loss=0.1422, over 4284590.23 frames. ], batch size: 414, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 04:16:50,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=98400.0, ans=0.125 2023-06-18 04:17:42,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=98520.0, ans=0.0 2023-06-18 04:17:44,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.149e+02 3.641e+02 4.935e+02 8.709e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-18 04:18:59,111 INFO [train.py:996] (0/4) Epoch 1, batch 16450, loss[loss=0.3498, simple_loss=0.386, pruned_loss=0.1568, over 21878.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3814, pruned_loss=0.1436, over 4289785.32 frames. ], batch size: 351, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 04:19:01,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=98700.0, ans=0.0 2023-06-18 04:20:06,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-18 04:20:43,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=98940.0, ans=10.0 2023-06-18 04:20:43,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=98940.0, ans=0.2 2023-06-18 04:20:54,862 INFO [train.py:996] (0/4) Epoch 1, batch 16500, loss[loss=0.3463, simple_loss=0.38, pruned_loss=0.1563, over 21115.00 frames. ], tot_loss[loss=0.3306, simple_loss=0.3771, pruned_loss=0.1421, over 4289985.45 frames. ], batch size: 607, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:21:17,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=15.0 2023-06-18 04:21:40,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99060.0, ans=0.1 2023-06-18 04:21:59,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=99120.0, ans=0.125 2023-06-18 04:22:00,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=99120.0, ans=0.2 2023-06-18 04:22:25,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.246e+02 4.122e+02 4.972e+02 8.514e+02, threshold=8.244e+02, percent-clipped=2.0 2023-06-18 04:22:37,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-18 04:23:43,059 INFO [train.py:996] (0/4) Epoch 1, batch 16550, loss[loss=0.3519, simple_loss=0.4032, pruned_loss=0.1503, over 21720.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3782, pruned_loss=0.1411, over 4278500.56 frames. ], batch size: 351, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:24:02,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99360.0, ans=0.125 2023-06-18 04:24:13,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=99420.0, ans=0.125 2023-06-18 04:25:23,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.26 vs. limit=22.5 2023-06-18 04:25:39,590 INFO [train.py:996] (0/4) Epoch 1, batch 16600, loss[loss=0.3557, simple_loss=0.4205, pruned_loss=0.1455, over 21637.00 frames. ], tot_loss[loss=0.34, simple_loss=0.388, pruned_loss=0.146, over 4282130.04 frames. ], batch size: 263, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 04:25:40,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=99600.0, ans=0.125 2023-06-18 04:25:55,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-18 04:26:20,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-18 04:26:21,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=99720.0, ans=0.2 2023-06-18 04:26:45,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.572e+02 4.785e+02 5.858e+02 1.029e+03, threshold=9.570e+02, percent-clipped=5.0 2023-06-18 04:27:29,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=99840.0, ans=0.07 2023-06-18 04:27:34,747 INFO [train.py:996] (0/4) Epoch 1, batch 16650, loss[loss=0.3494, simple_loss=0.3951, pruned_loss=0.1519, over 21786.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3983, pruned_loss=0.148, over 4274895.11 frames. ], batch size: 298, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:28:13,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=99960.0, ans=0.0 2023-06-18 04:29:26,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100080.0, ans=0.0 2023-06-18 04:29:28,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100140.0, ans=0.1 2023-06-18 04:29:35,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=100140.0, ans=0.125 2023-06-18 04:29:35,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=100140.0, ans=0.2 2023-06-18 04:29:35,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=100140.0, ans=0.0 2023-06-18 04:29:51,754 INFO [train.py:996] (0/4) Epoch 1, batch 16700, loss[loss=0.3334, simple_loss=0.396, pruned_loss=0.1354, over 21659.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.3971, pruned_loss=0.1477, over 4274725.23 frames. ], batch size: 414, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:30:18,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=100200.0, ans=0.2 2023-06-18 04:30:55,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=100320.0, ans=0.125 2023-06-18 04:31:12,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 3.595e+02 4.250e+02 5.162e+02 1.058e+03, threshold=8.499e+02, percent-clipped=1.0 2023-06-18 04:32:38,940 INFO [train.py:996] (0/4) Epoch 1, batch 16750, loss[loss=0.4939, simple_loss=0.5218, pruned_loss=0.2329, over 21404.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3999, pruned_loss=0.1507, over 4271944.92 frames. ], batch size: 507, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 04:34:21,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=100680.0, ans=0.125 2023-06-18 04:34:23,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=100680.0, ans=0.04949747468305833 2023-06-18 04:34:25,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=100680.0, ans=0.125 2023-06-18 04:34:29,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-18 04:35:13,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 04:35:18,799 INFO [train.py:996] (0/4) Epoch 1, batch 16800, loss[loss=0.219, simple_loss=0.2407, pruned_loss=0.09866, over 16800.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.4026, pruned_loss=0.1505, over 4268988.65 frames. ], batch size: 61, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:35:34,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100860.0, ans=0.1 2023-06-18 04:35:39,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=100860.0, ans=0.125 2023-06-18 04:35:43,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100860.0, ans=0.1 2023-06-18 04:36:05,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.547e+02 4.070e+02 4.993e+02 8.656e+02, threshold=8.140e+02, percent-clipped=1.0 2023-06-18 04:36:07,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-18 04:36:28,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=100980.0, ans=0.0 2023-06-18 04:37:02,095 INFO [train.py:996] (0/4) Epoch 1, batch 16850, loss[loss=0.3467, simple_loss=0.391, pruned_loss=0.1512, over 21906.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3996, pruned_loss=0.1506, over 4279494.66 frames. ], batch size: 118, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:39:12,672 INFO [train.py:996] (0/4) Epoch 1, batch 16900, loss[loss=0.3263, simple_loss=0.3649, pruned_loss=0.1439, over 21177.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3921, pruned_loss=0.1477, over 4285179.99 frames. ], batch size: 607, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 04:39:32,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-18 04:40:13,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=101520.0, ans=0.025 2023-06-18 04:40:16,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.290e+02 3.976e+02 5.128e+02 6.971e+02, threshold=7.952e+02, percent-clipped=0.0 2023-06-18 04:40:19,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101580.0, ans=0.125 2023-06-18 04:41:08,121 INFO [train.py:996] (0/4) Epoch 1, batch 16950, loss[loss=0.3789, simple_loss=0.4002, pruned_loss=0.1788, over 21597.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3846, pruned_loss=0.1454, over 4287081.08 frames. ], batch size: 471, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:41:08,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=101700.0, ans=0.0 2023-06-18 04:41:42,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=15.0 2023-06-18 04:43:28,696 INFO [train.py:996] (0/4) Epoch 1, batch 17000, loss[loss=0.3251, simple_loss=0.3653, pruned_loss=0.1425, over 21857.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3809, pruned_loss=0.1457, over 4287059.63 frames. ], batch size: 298, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:43:35,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-18 04:43:45,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.14 vs. limit=22.5 2023-06-18 04:44:39,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.298e+02 3.924e+02 4.876e+02 1.271e+03, threshold=7.848e+02, percent-clipped=1.0 2023-06-18 04:45:43,360 INFO [train.py:996] (0/4) Epoch 1, batch 17050, loss[loss=0.4288, simple_loss=0.4664, pruned_loss=0.1955, over 21549.00 frames. ], tot_loss[loss=0.3439, simple_loss=0.3885, pruned_loss=0.1496, over 4290459.34 frames. ], batch size: 471, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 04:45:54,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-18 04:47:43,114 INFO [train.py:996] (0/4) Epoch 1, batch 17100, loss[loss=0.3451, simple_loss=0.3762, pruned_loss=0.157, over 21932.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.387, pruned_loss=0.1491, over 4296080.40 frames. ], batch size: 333, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 04:48:24,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-18 04:48:28,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=102660.0, ans=10.0 2023-06-18 04:48:31,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=102660.0, ans=0.2 2023-06-18 04:49:06,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.244e+02 3.266e+02 3.852e+02 4.929e+02 1.111e+03, threshold=7.703e+02, percent-clipped=4.0 2023-06-18 04:49:33,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=102780.0, ans=0.0 2023-06-18 04:49:55,671 INFO [train.py:996] (0/4) Epoch 1, batch 17150, loss[loss=0.3761, simple_loss=0.3968, pruned_loss=0.1777, over 21719.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3823, pruned_loss=0.1483, over 4300030.36 frames. ], batch size: 508, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 04:50:12,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=102960.0, ans=0.125 2023-06-18 04:50:39,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=102960.0, ans=0.025 2023-06-18 04:51:49,914 INFO [train.py:996] (0/4) Epoch 1, batch 17200, loss[loss=0.3711, simple_loss=0.4034, pruned_loss=0.1694, over 19998.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.383, pruned_loss=0.1481, over 4301193.55 frames. ], batch size: 702, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:51:56,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103200.0, ans=0.1 2023-06-18 04:52:27,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=103260.0, ans=0.0 2023-06-18 04:52:28,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103260.0, ans=0.1 2023-06-18 04:52:36,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103260.0, ans=0.1 2023-06-18 04:53:02,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=103320.0, ans=0.0 2023-06-18 04:53:06,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.334e+02 4.026e+02 5.110e+02 1.056e+03, threshold=8.051e+02, percent-clipped=6.0 2023-06-18 04:54:01,781 INFO [train.py:996] (0/4) Epoch 1, batch 17250, loss[loss=0.3307, simple_loss=0.3857, pruned_loss=0.1379, over 21640.00 frames. ], tot_loss[loss=0.3445, simple_loss=0.3877, pruned_loss=0.1506, over 4301164.11 frames. ], batch size: 263, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:54:06,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=103500.0, ans=0.125 2023-06-18 04:54:27,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=103500.0, ans=0.2 2023-06-18 04:55:25,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=103620.0, ans=0.0 2023-06-18 04:55:58,634 INFO [train.py:996] (0/4) Epoch 1, batch 17300, loss[loss=0.3482, simple_loss=0.3945, pruned_loss=0.1509, over 21785.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3976, pruned_loss=0.1551, over 4293310.13 frames. ], batch size: 282, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 04:56:11,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=103800.0, ans=0.0 2023-06-18 04:56:30,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=103860.0, ans=0.04949747468305833 2023-06-18 04:57:28,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 4.761e+02 5.817e+02 1.132e+03, threshold=9.521e+02, percent-clipped=5.0 2023-06-18 04:58:03,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-18 04:58:39,409 INFO [train.py:996] (0/4) Epoch 1, batch 17350, loss[loss=0.3744, simple_loss=0.4264, pruned_loss=0.1612, over 20636.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3988, pruned_loss=0.1549, over 4289333.46 frames. ], batch size: 607, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 04:59:06,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=104160.0, ans=0.2 2023-06-18 04:59:17,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-18 04:59:31,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=104220.0, ans=0.125 2023-06-18 04:59:45,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=104280.0, ans=0.0 2023-06-18 05:00:36,714 INFO [train.py:996] (0/4) Epoch 1, batch 17400, loss[loss=0.2043, simple_loss=0.2255, pruned_loss=0.09153, over 16359.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3933, pruned_loss=0.1491, over 4280000.07 frames. ], batch size: 60, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 05:02:13,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.697e+02 4.904e+02 6.158e+02 8.783e+02, threshold=9.807e+02, percent-clipped=0.0 2023-06-18 05:03:10,194 INFO [train.py:996] (0/4) Epoch 1, batch 17450, loss[loss=0.238, simple_loss=0.3189, pruned_loss=0.07855, over 21808.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3862, pruned_loss=0.1441, over 4275038.15 frames. ], batch size: 316, lr: 2.83e-02, grad_scale: 32.0 2023-06-18 05:04:01,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=104760.0, ans=0.125 2023-06-18 05:04:11,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=104820.0, ans=0.125 2023-06-18 05:04:15,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=104820.0, ans=0.125 2023-06-18 05:04:46,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=104880.0, ans=0.125 2023-06-18 05:04:53,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=104880.0, ans=0.125 2023-06-18 05:05:20,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-18 05:05:29,179 INFO [train.py:996] (0/4) Epoch 1, batch 17500, loss[loss=0.2924, simple_loss=0.3418, pruned_loss=0.1215, over 21864.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3781, pruned_loss=0.1379, over 4282811.77 frames. ], batch size: 124, lr: 2.82e-02, grad_scale: 64.0 2023-06-18 05:05:51,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105060.0, ans=0.1 2023-06-18 05:06:08,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=105120.0, ans=0.2 2023-06-18 05:06:23,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.676e+02 3.184e+02 3.972e+02 6.733e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-18 05:07:10,587 INFO [train.py:996] (0/4) Epoch 1, batch 17550, loss[loss=0.3036, simple_loss=0.3683, pruned_loss=0.1195, over 21261.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3786, pruned_loss=0.1369, over 4288802.84 frames. ], batch size: 143, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 05:07:16,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=105300.0, ans=0.125 2023-06-18 05:07:25,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=105360.0, ans=0.125 2023-06-18 05:09:00,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=105540.0, ans=0.125 2023-06-18 05:09:08,737 INFO [train.py:996] (0/4) Epoch 1, batch 17600, loss[loss=0.3747, simple_loss=0.4103, pruned_loss=0.1696, over 21693.00 frames. ], tot_loss[loss=0.326, simple_loss=0.38, pruned_loss=0.1361, over 4278305.57 frames. ], batch size: 351, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 05:10:23,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.173e+02 4.298e+02 5.550e+02 1.174e+03, threshold=8.596e+02, percent-clipped=15.0 2023-06-18 05:10:37,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-18 05:10:58,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=105840.0, ans=0.125 2023-06-18 05:11:05,585 INFO [train.py:996] (0/4) Epoch 1, batch 17650, loss[loss=0.2702, simple_loss=0.3274, pruned_loss=0.1065, over 21839.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3761, pruned_loss=0.1354, over 4279194.78 frames. ], batch size: 317, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:11:16,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=105900.0, ans=0.07 2023-06-18 05:11:18,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105900.0, ans=0.125 2023-06-18 05:11:21,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105960.0, ans=0.1 2023-06-18 05:11:26,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.43 vs. limit=22.5 2023-06-18 05:11:51,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=106020.0, ans=0.125 2023-06-18 05:12:09,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=106080.0, ans=0.2 2023-06-18 05:12:42,040 INFO [train.py:996] (0/4) Epoch 1, batch 17700, loss[loss=0.4423, simple_loss=0.476, pruned_loss=0.2043, over 21437.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3714, pruned_loss=0.1319, over 4270891.38 frames. ], batch size: 471, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:12:58,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=106200.0, ans=0.09899494936611666 2023-06-18 05:14:06,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 3.144e+02 3.611e+02 4.728e+02 8.023e+02, threshold=7.222e+02, percent-clipped=0.0 2023-06-18 05:14:08,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-18 05:14:44,164 INFO [train.py:996] (0/4) Epoch 1, batch 17750, loss[loss=0.3494, simple_loss=0.3999, pruned_loss=0.1495, over 20644.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3815, pruned_loss=0.1384, over 4271935.08 frames. ], batch size: 607, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 05:15:17,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=106500.0, ans=0.125 2023-06-18 05:15:22,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-18 05:16:14,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=106680.0, ans=0.125 2023-06-18 05:16:41,541 INFO [train.py:996] (0/4) Epoch 1, batch 17800, loss[loss=0.3531, simple_loss=0.3993, pruned_loss=0.1534, over 21292.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3831, pruned_loss=0.1396, over 4276053.25 frames. ], batch size: 159, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:16:48,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=106800.0, ans=0.09899494936611666 2023-06-18 05:17:32,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=106860.0, ans=0.0 2023-06-18 05:18:11,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.015e+02 3.874e+02 4.676e+02 8.507e+02, threshold=7.748e+02, percent-clipped=1.0 2023-06-18 05:18:53,249 INFO [train.py:996] (0/4) Epoch 1, batch 17850, loss[loss=0.3128, simple_loss=0.3625, pruned_loss=0.1316, over 21325.00 frames. ], tot_loss[loss=0.33, simple_loss=0.3828, pruned_loss=0.1386, over 4267054.26 frames. ], batch size: 176, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:19:37,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107160.0, ans=0.1 2023-06-18 05:19:38,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=107160.0, ans=0.125 2023-06-18 05:20:40,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-18 05:21:11,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=107340.0, ans=0.125 2023-06-18 05:21:24,299 INFO [train.py:996] (0/4) Epoch 1, batch 17900, loss[loss=0.3201, simple_loss=0.395, pruned_loss=0.1226, over 21631.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3891, pruned_loss=0.1428, over 4256123.89 frames. ], batch size: 230, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 05:22:43,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.224e+02 3.725e+02 5.130e+02 9.496e+02, threshold=7.451e+02, percent-clipped=4.0 2023-06-18 05:23:25,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=107640.0, ans=0.125 2023-06-18 05:23:50,223 INFO [train.py:996] (0/4) Epoch 1, batch 17950, loss[loss=0.2853, simple_loss=0.3571, pruned_loss=0.1068, over 21777.00 frames. ], tot_loss[loss=0.3305, simple_loss=0.3863, pruned_loss=0.1374, over 4258524.13 frames. ], batch size: 282, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:25:17,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107880.0, ans=0.125 2023-06-18 05:25:51,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-18 05:25:58,835 INFO [train.py:996] (0/4) Epoch 1, batch 18000, loss[loss=0.2826, simple_loss=0.3269, pruned_loss=0.1191, over 21609.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3772, pruned_loss=0.135, over 4260484.58 frames. ], batch size: 332, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:25:58,836 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 05:26:54,355 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3106, simple_loss=0.4066, pruned_loss=0.1073, over 1796401.00 frames. 2023-06-18 05:26:54,357 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 05:27:15,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=108060.0, ans=0.125 2023-06-18 05:27:18,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=108060.0, ans=0.125 2023-06-18 05:27:32,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108120.0, ans=0.1 2023-06-18 05:27:47,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.259e+02 3.858e+02 4.507e+02 8.062e+02, threshold=7.716e+02, percent-clipped=1.0 2023-06-18 05:28:09,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108240.0, ans=0.1 2023-06-18 05:28:10,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=108240.0, ans=0.2 2023-06-18 05:28:30,788 INFO [train.py:996] (0/4) Epoch 1, batch 18050, loss[loss=0.3128, simple_loss=0.351, pruned_loss=0.1373, over 21405.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3708, pruned_loss=0.134, over 4268867.99 frames. ], batch size: 194, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 05:28:38,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=108300.0, ans=0.0 2023-06-18 05:29:46,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=108420.0, ans=0.04949747468305833 2023-06-18 05:29:56,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=108480.0, ans=0.125 2023-06-18 05:30:34,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108540.0, ans=0.125 2023-06-18 05:30:51,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=108600.0, ans=0.0 2023-06-18 05:30:52,495 INFO [train.py:996] (0/4) Epoch 1, batch 18100, loss[loss=0.3858, simple_loss=0.4236, pruned_loss=0.174, over 21383.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3787, pruned_loss=0.1385, over 4265895.15 frames. ], batch size: 549, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:30:58,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=108600.0, ans=0.2 2023-06-18 05:31:17,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=108660.0, ans=0.0 2023-06-18 05:31:52,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-06-18 05:32:05,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-18 05:32:17,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.257e+02 4.345e+02 5.048e+02 8.084e+02, threshold=8.690e+02, percent-clipped=1.0 2023-06-18 05:32:17,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=108780.0, ans=0.2 2023-06-18 05:32:19,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.65 vs. limit=15.0 2023-06-18 05:32:27,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108780.0, ans=0.1 2023-06-18 05:32:29,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=108780.0, ans=0.2 2023-06-18 05:32:53,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=108840.0, ans=0.2 2023-06-18 05:32:53,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=108840.0, ans=0.125 2023-06-18 05:32:54,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108840.0, ans=0.1 2023-06-18 05:33:06,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=108840.0, ans=0.2 2023-06-18 05:33:28,588 INFO [train.py:996] (0/4) Epoch 1, batch 18150, loss[loss=0.4365, simple_loss=0.4878, pruned_loss=0.1926, over 19888.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3796, pruned_loss=0.1372, over 4265561.91 frames. ], batch size: 702, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:33:28,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=108900.0, ans=0.125 2023-06-18 05:33:32,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=108900.0, ans=0.2 2023-06-18 05:34:02,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108960.0, ans=0.1 2023-06-18 05:34:04,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=108960.0, ans=0.07 2023-06-18 05:34:37,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=109080.0, ans=0.0 2023-06-18 05:34:44,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=109080.0, ans=0.125 2023-06-18 05:35:07,210 INFO [train.py:996] (0/4) Epoch 1, batch 18200, loss[loss=0.2644, simple_loss=0.3231, pruned_loss=0.1029, over 21616.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3738, pruned_loss=0.1372, over 4262359.31 frames. ], batch size: 132, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 05:35:07,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=109200.0, ans=0.0 2023-06-18 05:35:56,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.04 vs. limit=6.0 2023-06-18 05:36:12,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.177e+02 3.790e+02 4.775e+02 7.519e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-18 05:36:41,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-18 05:37:00,791 INFO [train.py:996] (0/4) Epoch 1, batch 18250, loss[loss=0.2531, simple_loss=0.3113, pruned_loss=0.09744, over 21484.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3623, pruned_loss=0.1316, over 4258361.00 frames. ], batch size: 212, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:37:17,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109500.0, ans=0.1 2023-06-18 05:39:24,243 INFO [train.py:996] (0/4) Epoch 1, batch 18300, loss[loss=0.3433, simple_loss=0.417, pruned_loss=0.1348, over 21414.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3619, pruned_loss=0.1319, over 4246259.12 frames. ], batch size: 211, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:39:28,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=109800.0, ans=0.1 2023-06-18 05:41:05,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.323e+02 3.972e+02 4.906e+02 9.934e+02, threshold=7.944e+02, percent-clipped=3.0 2023-06-18 05:41:17,068 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:41:20,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=109980.0, ans=0.2 2023-06-18 05:41:26,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=110040.0, ans=0.125 2023-06-18 05:41:43,039 INFO [train.py:996] (0/4) Epoch 1, batch 18350, loss[loss=0.289, simple_loss=0.3503, pruned_loss=0.1138, over 21394.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.371, pruned_loss=0.1336, over 4243381.45 frames. ], batch size: 211, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 05:41:54,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-18 05:44:12,494 INFO [train.py:996] (0/4) Epoch 1, batch 18400, loss[loss=0.3727, simple_loss=0.3968, pruned_loss=0.1743, over 19979.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3667, pruned_loss=0.1321, over 4239906.16 frames. ], batch size: 702, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:44:20,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=110400.0, ans=0.2 2023-06-18 05:44:27,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=110460.0, ans=0.0 2023-06-18 05:45:02,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110520.0, ans=0.1 2023-06-18 05:45:05,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=110520.0, ans=0.04949747468305833 2023-06-18 05:45:21,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=110520.0, ans=0.0 2023-06-18 05:45:22,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.119e+02 3.679e+02 4.893e+02 6.747e+02, threshold=7.358e+02, percent-clipped=0.0 2023-06-18 05:46:30,254 INFO [train.py:996] (0/4) Epoch 1, batch 18450, loss[loss=0.4059, simple_loss=0.4983, pruned_loss=0.1567, over 19670.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3605, pruned_loss=0.1258, over 4237728.47 frames. ], batch size: 702, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:48:26,250 INFO [train.py:996] (0/4) Epoch 1, batch 18500, loss[loss=0.2615, simple_loss=0.3535, pruned_loss=0.08473, over 20811.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3564, pruned_loss=0.1233, over 4248063.29 frames. ], batch size: 608, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:48:31,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=111000.0, ans=0.0 2023-06-18 05:48:35,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2023-06-18 05:49:02,591 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:49:20,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.596e+02 4.284e+02 6.309e+02 9.887e+02, threshold=8.569e+02, percent-clipped=11.0 2023-06-18 05:49:28,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=111180.0, ans=0.0 2023-06-18 05:49:35,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=111180.0, ans=0.125 2023-06-18 05:50:02,713 INFO [train.py:996] (0/4) Epoch 1, batch 18550, loss[loss=0.3128, simple_loss=0.3864, pruned_loss=0.1196, over 21478.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3547, pruned_loss=0.1232, over 4248724.35 frames. ], batch size: 471, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 05:50:06,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=111300.0, ans=0.0 2023-06-18 05:50:12,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-18 05:50:39,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111420.0, ans=0.125 2023-06-18 05:52:01,496 INFO [train.py:996] (0/4) Epoch 1, batch 18600, loss[loss=0.3369, simple_loss=0.3634, pruned_loss=0.1552, over 20136.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3541, pruned_loss=0.125, over 4252415.20 frames. ], batch size: 702, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:53:20,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.842e+02 3.411e+02 4.162e+02 7.856e+02, threshold=6.821e+02, percent-clipped=0.0 2023-06-18 05:53:33,305 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:54:02,521 INFO [train.py:996] (0/4) Epoch 1, batch 18650, loss[loss=0.2757, simple_loss=0.3174, pruned_loss=0.117, over 21764.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3524, pruned_loss=0.1246, over 4260847.05 frames. ], batch size: 124, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:54:22,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=111960.0, ans=0.0 2023-06-18 05:54:58,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112080.0, ans=0.0 2023-06-18 05:55:17,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=112080.0, ans=0.125 2023-06-18 05:55:40,745 INFO [train.py:996] (0/4) Epoch 1, batch 18700, loss[loss=0.2867, simple_loss=0.3233, pruned_loss=0.125, over 21429.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3491, pruned_loss=0.1256, over 4257452.37 frames. ], batch size: 194, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 05:57:01,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.044e+02 3.629e+02 4.654e+02 7.971e+02, threshold=7.259e+02, percent-clipped=4.0 2023-06-18 05:57:09,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=112380.0, ans=0.1 2023-06-18 05:57:47,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=112500.0, ans=0.02 2023-06-18 05:57:50,358 INFO [train.py:996] (0/4) Epoch 1, batch 18750, loss[loss=0.3095, simple_loss=0.3586, pruned_loss=0.1302, over 21834.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3511, pruned_loss=0.1286, over 4259685.00 frames. ], batch size: 124, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 05:57:51,314 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-18 05:58:48,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=112560.0, ans=0.2 2023-06-18 05:59:11,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=112620.0, ans=0.125 2023-06-18 05:59:11,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-18 06:00:27,312 INFO [train.py:996] (0/4) Epoch 1, batch 18800, loss[loss=0.2378, simple_loss=0.2973, pruned_loss=0.08913, over 21192.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.356, pruned_loss=0.1293, over 4254410.73 frames. ], batch size: 143, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 06:00:30,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=112800.0, ans=0.125 2023-06-18 06:01:57,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112920.0, ans=0.0 2023-06-18 06:02:08,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 3.299e+02 4.034e+02 4.833e+02 8.926e+02, threshold=8.067e+02, percent-clipped=4.0 2023-06-18 06:02:18,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=112980.0, ans=0.2 2023-06-18 06:02:49,581 INFO [train.py:996] (0/4) Epoch 1, batch 18850, loss[loss=0.2909, simple_loss=0.3387, pruned_loss=0.1216, over 21694.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3513, pruned_loss=0.1231, over 4265263.67 frames. ], batch size: 333, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 06:02:51,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=113100.0, ans=0.125 2023-06-18 06:03:01,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113100.0, ans=0.0 2023-06-18 06:04:27,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.74 vs. limit=22.5 2023-06-18 06:04:35,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=113280.0, ans=0.95 2023-06-18 06:04:56,939 INFO [train.py:996] (0/4) Epoch 1, batch 18900, loss[loss=0.3005, simple_loss=0.3471, pruned_loss=0.1269, over 21729.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3489, pruned_loss=0.1241, over 4266284.74 frames. ], batch size: 112, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:05:04,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=113400.0, ans=0.0 2023-06-18 06:06:05,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-18 06:06:31,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=113520.0, ans=0.025 2023-06-18 06:06:33,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=113520.0, ans=0.125 2023-06-18 06:06:38,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 3.081e+02 3.640e+02 4.781e+02 9.031e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-18 06:07:40,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.27 vs. limit=22.5 2023-06-18 06:07:41,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=113640.0, ans=0.2 2023-06-18 06:07:42,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113700.0, ans=0.0 2023-06-18 06:07:43,554 INFO [train.py:996] (0/4) Epoch 1, batch 18950, loss[loss=0.3482, simple_loss=0.4029, pruned_loss=0.1468, over 21812.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3526, pruned_loss=0.1286, over 4270793.09 frames. ], batch size: 351, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:07:45,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=113700.0, ans=0.2 2023-06-18 06:08:41,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=113760.0, ans=0.05 2023-06-18 06:08:46,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=113760.0, ans=0.025 2023-06-18 06:09:29,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113880.0, ans=0.0 2023-06-18 06:10:14,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113940.0, ans=0.1 2023-06-18 06:10:26,259 INFO [train.py:996] (0/4) Epoch 1, batch 19000, loss[loss=0.3655, simple_loss=0.436, pruned_loss=0.1475, over 21504.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3634, pruned_loss=0.1307, over 4272934.24 frames. ], batch size: 473, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 06:10:54,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=114060.0, ans=0.0 2023-06-18 06:11:11,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=114120.0, ans=0.125 2023-06-18 06:11:12,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-18 06:11:28,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.443e+02 4.158e+02 4.977e+02 1.551e+03, threshold=8.315e+02, percent-clipped=7.0 2023-06-18 06:11:52,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=114180.0, ans=0.125 2023-06-18 06:12:20,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2023-06-18 06:12:22,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=114240.0, ans=0.05 2023-06-18 06:12:32,963 INFO [train.py:996] (0/4) Epoch 1, batch 19050, loss[loss=0.2982, simple_loss=0.3423, pruned_loss=0.1271, over 21304.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3701, pruned_loss=0.1364, over 4279673.15 frames. ], batch size: 159, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:12:52,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-18 06:13:14,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-06-18 06:15:15,484 INFO [train.py:996] (0/4) Epoch 1, batch 19100, loss[loss=0.3033, simple_loss=0.3428, pruned_loss=0.1319, over 21727.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.3688, pruned_loss=0.1383, over 4282187.47 frames. ], batch size: 316, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:16:03,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-18 06:16:32,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.357e+02 4.030e+02 5.006e+02 7.474e+02, threshold=8.060e+02, percent-clipped=0.0 2023-06-18 06:17:39,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-18 06:18:01,642 INFO [train.py:996] (0/4) Epoch 1, batch 19150, loss[loss=0.344, simple_loss=0.4155, pruned_loss=0.1362, over 21787.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3726, pruned_loss=0.1399, over 4285681.04 frames. ], batch size: 282, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 06:18:14,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=114900.0, ans=0.0 2023-06-18 06:18:54,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115020.0, ans=0.1 2023-06-18 06:20:21,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-18 06:20:23,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=115140.0, ans=0.125 2023-06-18 06:20:38,012 INFO [train.py:996] (0/4) Epoch 1, batch 19200, loss[loss=0.3871, simple_loss=0.4052, pruned_loss=0.1845, over 20058.00 frames. ], tot_loss[loss=0.3318, simple_loss=0.3822, pruned_loss=0.1407, over 4275459.71 frames. ], batch size: 702, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:21:49,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 3.092e+02 3.837e+02 4.671e+02 8.670e+02, threshold=7.675e+02, percent-clipped=1.0 2023-06-18 06:22:37,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=115440.0, ans=0.2 2023-06-18 06:23:03,907 INFO [train.py:996] (0/4) Epoch 1, batch 19250, loss[loss=0.2735, simple_loss=0.3529, pruned_loss=0.09709, over 21660.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3808, pruned_loss=0.1329, over 4272423.55 frames. ], batch size: 441, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:23:07,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=115500.0, ans=0.07 2023-06-18 06:23:08,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.27 vs. limit=15.0 2023-06-18 06:25:40,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=115800.0, ans=0.95 2023-06-18 06:25:41,822 INFO [train.py:996] (0/4) Epoch 1, batch 19300, loss[loss=0.3697, simple_loss=0.4049, pruned_loss=0.1673, over 21611.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3783, pruned_loss=0.1339, over 4276360.16 frames. ], batch size: 508, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:25:42,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=115800.0, ans=0.0 2023-06-18 06:25:53,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115800.0, ans=0.125 2023-06-18 06:26:02,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=115800.0, ans=0.07 2023-06-18 06:26:05,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115800.0, ans=0.1 2023-06-18 06:26:13,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=115860.0, ans=0.125 2023-06-18 06:26:14,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.77 vs. limit=22.5 2023-06-18 06:27:11,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=115920.0, ans=0.0 2023-06-18 06:27:15,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.985e+02 3.697e+02 4.596e+02 6.937e+02, threshold=7.395e+02, percent-clipped=0.0 2023-06-18 06:27:29,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=115980.0, ans=0.0 2023-06-18 06:27:32,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.00 vs. limit=10.0 2023-06-18 06:27:42,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=116040.0, ans=0.02 2023-06-18 06:28:28,633 INFO [train.py:996] (0/4) Epoch 1, batch 19350, loss[loss=0.2401, simple_loss=0.2934, pruned_loss=0.09339, over 21838.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3693, pruned_loss=0.1275, over 4276606.85 frames. ], batch size: 107, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 06:29:36,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=12.0 2023-06-18 06:29:38,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=116220.0, ans=0.2 2023-06-18 06:29:49,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.18 vs. limit=5.0 2023-06-18 06:29:57,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=116280.0, ans=0.125 2023-06-18 06:30:00,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=116280.0, ans=0.2 2023-06-18 06:30:59,107 INFO [train.py:996] (0/4) Epoch 1, batch 19400, loss[loss=0.2537, simple_loss=0.3103, pruned_loss=0.09856, over 21202.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3663, pruned_loss=0.1255, over 4272998.54 frames. ], batch size: 159, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:31:20,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116400.0, ans=0.1 2023-06-18 06:31:22,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-18 06:32:03,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=116520.0, ans=0.04949747468305833 2023-06-18 06:32:21,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.419e+02 3.848e+02 4.708e+02 7.710e+02, threshold=7.695e+02, percent-clipped=3.0 2023-06-18 06:32:34,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=116580.0, ans=0.5 2023-06-18 06:33:32,546 INFO [train.py:996] (0/4) Epoch 1, batch 19450, loss[loss=0.2981, simple_loss=0.3412, pruned_loss=0.1275, over 21795.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3653, pruned_loss=0.1294, over 4278609.91 frames. ], batch size: 351, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:34:50,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=116880.0, ans=15.0 2023-06-18 06:35:49,144 INFO [train.py:996] (0/4) Epoch 1, batch 19500, loss[loss=0.2684, simple_loss=0.3158, pruned_loss=0.1105, over 21201.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3605, pruned_loss=0.131, over 4276777.90 frames. ], batch size: 176, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 06:37:30,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.181e+02 3.762e+02 4.814e+02 7.838e+02, threshold=7.523e+02, percent-clipped=1.0 2023-06-18 06:37:33,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=117180.0, ans=0.0 2023-06-18 06:37:48,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-18 06:38:40,444 INFO [train.py:996] (0/4) Epoch 1, batch 19550, loss[loss=0.2767, simple_loss=0.3521, pruned_loss=0.1007, over 21569.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3547, pruned_loss=0.1272, over 4273925.55 frames. ], batch size: 230, lr: 2.69e-02, grad_scale: 64.0 2023-06-18 06:38:46,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117300.0, ans=0.1 2023-06-18 06:38:54,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=117300.0, ans=0.125 2023-06-18 06:38:58,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=117360.0, ans=0.125 2023-06-18 06:39:22,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=117360.0, ans=0.125 2023-06-18 06:41:01,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=117600.0, ans=0.2 2023-06-18 06:41:02,441 INFO [train.py:996] (0/4) Epoch 1, batch 19600, loss[loss=0.338, simple_loss=0.3819, pruned_loss=0.147, over 21876.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3583, pruned_loss=0.1306, over 4285163.00 frames. ], batch size: 107, lr: 2.69e-02, grad_scale: 64.0 2023-06-18 06:41:56,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=117660.0, ans=0.0 2023-06-18 06:41:58,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=8.0 2023-06-18 06:42:30,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.158e+02 3.823e+02 5.330e+02 9.735e+02, threshold=7.645e+02, percent-clipped=7.0 2023-06-18 06:42:38,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=117780.0, ans=0.0 2023-06-18 06:43:45,968 INFO [train.py:996] (0/4) Epoch 1, batch 19650, loss[loss=0.3317, simple_loss=0.3707, pruned_loss=0.1463, over 21859.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3671, pruned_loss=0.1376, over 4288659.72 frames. ], batch size: 282, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 06:44:07,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-18 06:44:46,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117960.0, ans=0.1 2023-06-18 06:44:46,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=117960.0, ans=0.2 2023-06-18 06:45:05,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=118020.0, ans=0.125 2023-06-18 06:45:59,175 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:46:15,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=118140.0, ans=0.0 2023-06-18 06:46:47,850 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:46:50,132 INFO [train.py:996] (0/4) Epoch 1, batch 19700, loss[loss=0.3005, simple_loss=0.3718, pruned_loss=0.1146, over 21784.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3711, pruned_loss=0.1383, over 4281498.42 frames. ], batch size: 316, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:46:50,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118200.0, ans=0.0 2023-06-18 06:47:24,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=118260.0, ans=0.07 2023-06-18 06:47:42,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=118320.0, ans=0.125 2023-06-18 06:48:00,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=118320.0, ans=0.125 2023-06-18 06:48:02,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=118320.0, ans=0.2 2023-06-18 06:48:30,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 3.429e+02 4.231e+02 5.435e+02 1.062e+03, threshold=8.463e+02, percent-clipped=10.0 2023-06-18 06:49:32,821 INFO [train.py:996] (0/4) Epoch 1, batch 19750, loss[loss=0.3224, simple_loss=0.3844, pruned_loss=0.1302, over 21486.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.38, pruned_loss=0.1396, over 4274977.53 frames. ], batch size: 211, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:49:48,719 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:50:07,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=118560.0, ans=0.0 2023-06-18 06:50:19,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.78 vs. limit=6.0 2023-06-18 06:50:55,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-18 06:51:03,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-18 06:51:58,220 INFO [train.py:996] (0/4) Epoch 1, batch 19800, loss[loss=0.2563, simple_loss=0.3051, pruned_loss=0.1038, over 21280.00 frames. ], tot_loss[loss=0.3313, simple_loss=0.3806, pruned_loss=0.141, over 4274683.55 frames. ], batch size: 176, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:52:25,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-18 06:53:30,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=118920.0, ans=0.125 2023-06-18 06:53:46,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.372e+02 3.953e+02 4.969e+02 1.016e+03, threshold=7.905e+02, percent-clipped=3.0 2023-06-18 06:54:10,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=15.0 2023-06-18 06:54:39,912 INFO [train.py:996] (0/4) Epoch 1, batch 19850, loss[loss=0.2513, simple_loss=0.3271, pruned_loss=0.08769, over 21710.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3691, pruned_loss=0.1323, over 4271238.50 frames. ], batch size: 332, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 06:54:46,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-18 06:55:29,442 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:55:43,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=119220.0, ans=0.125 2023-06-18 06:55:59,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=119280.0, ans=0.125 2023-06-18 06:57:16,284 INFO [train.py:996] (0/4) Epoch 1, batch 19900, loss[loss=0.292, simple_loss=0.3504, pruned_loss=0.1168, over 21761.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3687, pruned_loss=0.1287, over 4269543.94 frames. ], batch size: 316, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 06:57:34,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=119400.0, ans=0.0 2023-06-18 06:58:40,850 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.179e+02 3.927e+02 4.733e+02 6.841e+02, threshold=7.854e+02, percent-clipped=0.0 2023-06-18 06:58:53,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=119580.0, ans=0.0 2023-06-18 06:59:35,268 INFO [train.py:996] (0/4) Epoch 1, batch 19950, loss[loss=0.2849, simple_loss=0.3353, pruned_loss=0.1173, over 21404.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3625, pruned_loss=0.1289, over 4261663.16 frames. ], batch size: 194, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 07:01:33,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119880.0, ans=0.1 2023-06-18 07:01:52,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-18 07:02:19,274 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-20000.pt 2023-06-18 07:02:22,519 INFO [train.py:996] (0/4) Epoch 1, batch 20000, loss[loss=0.3323, simple_loss=0.3986, pruned_loss=0.1329, over 21593.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3621, pruned_loss=0.1275, over 4251801.07 frames. ], batch size: 389, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 07:03:19,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120060.0, ans=0.1 2023-06-18 07:03:43,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=120180.0, ans=0.125 2023-06-18 07:03:43,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=120180.0, ans=0.125 2023-06-18 07:03:43,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.393e+02 3.852e+02 4.728e+02 8.512e+02, threshold=7.705e+02, percent-clipped=1.0 2023-06-18 07:04:27,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=120240.0, ans=0.0 2023-06-18 07:04:40,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=120240.0, ans=0.02 2023-06-18 07:04:59,168 INFO [train.py:996] (0/4) Epoch 1, batch 20050, loss[loss=0.3231, simple_loss=0.3761, pruned_loss=0.1351, over 21840.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3656, pruned_loss=0.1314, over 4266784.49 frames. ], batch size: 282, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:05:44,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120360.0, ans=0.1 2023-06-18 07:06:01,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=120420.0, ans=0.125 2023-06-18 07:06:47,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=120480.0, ans=0.035 2023-06-18 07:07:07,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=120540.0, ans=0.125 2023-06-18 07:07:39,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=120540.0, ans=0.125 2023-06-18 07:07:43,289 INFO [train.py:996] (0/4) Epoch 1, batch 20100, loss[loss=0.3276, simple_loss=0.3795, pruned_loss=0.1379, over 21356.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3698, pruned_loss=0.1354, over 4275151.08 frames. ], batch size: 211, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:07:45,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=120600.0, ans=0.5 2023-06-18 07:08:00,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-18 07:08:34,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120660.0, ans=0.125 2023-06-18 07:09:37,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.314e+02 3.983e+02 4.752e+02 1.053e+03, threshold=7.965e+02, percent-clipped=3.0 2023-06-18 07:09:53,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=120840.0, ans=0.0 2023-06-18 07:10:01,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=120840.0, ans=0.0 2023-06-18 07:10:18,621 INFO [train.py:996] (0/4) Epoch 1, batch 20150, loss[loss=0.4343, simple_loss=0.4584, pruned_loss=0.2051, over 21778.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.384, pruned_loss=0.1423, over 4272332.95 frames. ], batch size: 441, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 07:12:10,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121080.0, ans=0.1 2023-06-18 07:13:15,480 INFO [train.py:996] (0/4) Epoch 1, batch 20200, loss[loss=0.3473, simple_loss=0.4337, pruned_loss=0.1305, over 20765.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3883, pruned_loss=0.1453, over 4265968.05 frames. ], batch size: 607, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:14:34,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=121320.0, ans=0.5 2023-06-18 07:15:07,284 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:15:08,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 3.389e+02 4.045e+02 4.963e+02 8.967e+02, threshold=8.091e+02, percent-clipped=1.0 2023-06-18 07:15:22,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=121380.0, ans=0.2 2023-06-18 07:15:29,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=121440.0, ans=0.125 2023-06-18 07:16:01,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-18 07:16:06,300 INFO [train.py:996] (0/4) Epoch 1, batch 20250, loss[loss=0.3004, simple_loss=0.3659, pruned_loss=0.1174, over 21768.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3867, pruned_loss=0.1419, over 4266217.40 frames. ], batch size: 247, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:17:06,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2023-06-18 07:17:14,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=121620.0, ans=0.125 2023-06-18 07:17:44,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=121680.0, ans=0.0 2023-06-18 07:17:58,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=121680.0, ans=0.0 2023-06-18 07:18:32,718 INFO [train.py:996] (0/4) Epoch 1, batch 20300, loss[loss=0.3067, simple_loss=0.3747, pruned_loss=0.1194, over 21756.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3816, pruned_loss=0.1364, over 4251757.39 frames. ], batch size: 316, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:18:33,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=121800.0, ans=0.125 2023-06-18 07:19:58,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.892e+02 3.299e+02 4.163e+02 6.538e+02, threshold=6.599e+02, percent-clipped=0.0 2023-06-18 07:20:43,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122040.0, ans=0.1 2023-06-18 07:20:52,730 INFO [train.py:996] (0/4) Epoch 1, batch 20350, loss[loss=0.3578, simple_loss=0.3926, pruned_loss=0.1614, over 21519.00 frames. ], tot_loss[loss=0.3274, simple_loss=0.3808, pruned_loss=0.137, over 4243816.15 frames. ], batch size: 548, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 07:21:00,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.41 vs. limit=22.5 2023-06-18 07:23:24,535 INFO [train.py:996] (0/4) Epoch 1, batch 20400, loss[loss=0.3719, simple_loss=0.4088, pruned_loss=0.1675, over 21281.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3854, pruned_loss=0.1412, over 4241268.18 frames. ], batch size: 143, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:23:33,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=122400.0, ans=0.0 2023-06-18 07:23:39,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=122400.0, ans=12.0 2023-06-18 07:24:14,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=122460.0, ans=0.125 2023-06-18 07:24:32,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=122520.0, ans=0.125 2023-06-18 07:24:36,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.452e+02 4.202e+02 5.218e+02 1.418e+03, threshold=8.403e+02, percent-clipped=8.0 2023-06-18 07:24:38,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-06-18 07:25:22,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122640.0, ans=0.1 2023-06-18 07:25:30,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-18 07:25:38,621 INFO [train.py:996] (0/4) Epoch 1, batch 20450, loss[loss=0.3226, simple_loss=0.3577, pruned_loss=0.1437, over 20809.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3874, pruned_loss=0.1448, over 4226657.46 frames. ], batch size: 608, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:26:51,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=122820.0, ans=0.0 2023-06-18 07:26:53,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=122820.0, ans=0.125 2023-06-18 07:28:17,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-18 07:28:18,145 INFO [train.py:996] (0/4) Epoch 1, batch 20500, loss[loss=0.3298, simple_loss=0.3508, pruned_loss=0.1544, over 21406.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3826, pruned_loss=0.1446, over 4229082.17 frames. ], batch size: 548, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 07:28:24,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-18 07:29:07,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=123120.0, ans=0.125 2023-06-18 07:29:32,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=123120.0, ans=0.125 2023-06-18 07:29:42,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-18 07:29:54,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 3.481e+02 4.279e+02 5.718e+02 8.765e+02, threshold=8.557e+02, percent-clipped=1.0 2023-06-18 07:30:19,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=123240.0, ans=0.0 2023-06-18 07:30:36,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123240.0, ans=0.1 2023-06-18 07:30:47,879 INFO [train.py:996] (0/4) Epoch 1, batch 20550, loss[loss=0.3294, simple_loss=0.389, pruned_loss=0.1349, over 21860.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3724, pruned_loss=0.1408, over 4225425.62 frames. ], batch size: 372, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:30:48,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=123300.0, ans=0.125 2023-06-18 07:31:31,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-18 07:31:50,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=123420.0, ans=0.025 2023-06-18 07:33:01,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=123540.0, ans=0.1 2023-06-18 07:33:33,246 INFO [train.py:996] (0/4) Epoch 1, batch 20600, loss[loss=0.3125, simple_loss=0.3634, pruned_loss=0.1308, over 21777.00 frames. ], tot_loss[loss=0.3256, simple_loss=0.374, pruned_loss=0.1386, over 4245586.70 frames. ], batch size: 247, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:34:14,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=123660.0, ans=0.125 2023-06-18 07:35:13,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.900e+02 3.607e+02 4.743e+02 9.151e+02, threshold=7.215e+02, percent-clipped=2.0 2023-06-18 07:35:23,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=123780.0, ans=0.125 2023-06-18 07:36:08,484 INFO [train.py:996] (0/4) Epoch 1, batch 20650, loss[loss=0.3219, simple_loss=0.3556, pruned_loss=0.1441, over 21621.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3701, pruned_loss=0.139, over 4240900.96 frames. ], batch size: 414, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:36:10,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123900.0, ans=0.1 2023-06-18 07:36:35,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=123900.0, ans=0.0 2023-06-18 07:37:37,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.12 vs. limit=22.5 2023-06-18 07:37:43,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=124080.0, ans=0.1 2023-06-18 07:37:56,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-18 07:38:28,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124140.0, ans=0.1 2023-06-18 07:38:40,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-18 07:38:46,610 INFO [train.py:996] (0/4) Epoch 1, batch 20700, loss[loss=0.3096, simple_loss=0.348, pruned_loss=0.1356, over 21533.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3619, pruned_loss=0.1335, over 4249504.95 frames. ], batch size: 441, lr: 2.63e-02, grad_scale: 16.0 2023-06-18 07:38:46,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124200.0, ans=0.1 2023-06-18 07:39:52,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=124320.0, ans=0.125 2023-06-18 07:40:16,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.474e+02 3.916e+02 5.055e+02 8.009e+02, threshold=7.832e+02, percent-clipped=2.0 2023-06-18 07:40:50,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=124440.0, ans=10.0 2023-06-18 07:41:15,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124440.0, ans=0.125 2023-06-18 07:41:27,302 INFO [train.py:996] (0/4) Epoch 1, batch 20750, loss[loss=0.4171, simple_loss=0.4825, pruned_loss=0.1758, over 21673.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3629, pruned_loss=0.1308, over 4249873.16 frames. ], batch size: 414, lr: 2.62e-02, grad_scale: 16.0 2023-06-18 07:42:30,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124620.0, ans=0.1 2023-06-18 07:43:08,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=124680.0, ans=0.125 2023-06-18 07:44:05,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-18 07:44:11,471 INFO [train.py:996] (0/4) Epoch 1, batch 20800, loss[loss=0.3159, simple_loss=0.3602, pruned_loss=0.1358, over 21797.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3692, pruned_loss=0.1339, over 4252821.49 frames. ], batch size: 102, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:44:18,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-18 07:44:30,038 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:44:37,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=124860.0, ans=15.0 2023-06-18 07:44:38,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124860.0, ans=0.1 2023-06-18 07:45:03,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=124920.0, ans=0.0 2023-06-18 07:45:37,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.418e+02 4.058e+02 5.141e+02 9.579e+02, threshold=8.117e+02, percent-clipped=5.0 2023-06-18 07:45:53,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=124980.0, ans=0.0 2023-06-18 07:46:20,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.44 vs. limit=22.5 2023-06-18 07:46:24,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-18 07:46:32,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125040.0, ans=0.0 2023-06-18 07:46:35,473 INFO [train.py:996] (0/4) Epoch 1, batch 20850, loss[loss=0.302, simple_loss=0.3477, pruned_loss=0.1282, over 21394.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.359, pruned_loss=0.13, over 4251829.56 frames. ], batch size: 131, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:46:43,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=125100.0, ans=0.0 2023-06-18 07:46:43,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-18 07:47:28,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=125160.0, ans=0.125 2023-06-18 07:47:47,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=125220.0, ans=0.125 2023-06-18 07:49:08,586 INFO [train.py:996] (0/4) Epoch 1, batch 20900, loss[loss=0.3841, simple_loss=0.4133, pruned_loss=0.1774, over 21700.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.363, pruned_loss=0.1334, over 4262755.49 frames. ], batch size: 473, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 07:50:24,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.865e+02 3.704e+02 4.826e+02 8.208e+02, threshold=7.408e+02, percent-clipped=1.0 2023-06-18 07:50:38,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=125640.0, ans=0.125 2023-06-18 07:51:11,909 INFO [train.py:996] (0/4) Epoch 1, batch 20950, loss[loss=0.2377, simple_loss=0.3007, pruned_loss=0.08736, over 21600.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3562, pruned_loss=0.1266, over 4271606.05 frames. ], batch size: 263, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:51:15,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125700.0, ans=0.125 2023-06-18 07:51:57,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-18 07:53:02,756 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-18 07:53:15,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=125940.0, ans=15.0 2023-06-18 07:53:19,310 INFO [train.py:996] (0/4) Epoch 1, batch 21000, loss[loss=0.3756, simple_loss=0.3984, pruned_loss=0.1764, over 21804.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3574, pruned_loss=0.1284, over 4267294.32 frames. ], batch size: 441, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:53:19,311 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 07:54:12,192 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.8034, 3.4731, 3.4150, 3.2448], device='cuda:0') 2023-06-18 07:54:13,248 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3148, simple_loss=0.4061, pruned_loss=0.1118, over 1796401.00 frames. 2023-06-18 07:54:13,253 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 07:54:49,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=126060.0, ans=0.125 2023-06-18 07:55:06,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-18 07:55:12,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=126180.0, ans=0.0 2023-06-18 07:55:15,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.966e+02 3.592e+02 4.576e+02 7.047e+02, threshold=7.185e+02, percent-clipped=0.0 2023-06-18 07:55:30,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=126180.0, ans=0.1 2023-06-18 07:56:15,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=126240.0, ans=0.95 2023-06-18 07:56:21,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.29 vs. limit=15.0 2023-06-18 07:56:21,893 INFO [train.py:996] (0/4) Epoch 1, batch 21050, loss[loss=0.2225, simple_loss=0.2853, pruned_loss=0.07982, over 15946.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3555, pruned_loss=0.1286, over 4259300.61 frames. ], batch size: 61, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 07:56:24,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-18 07:57:12,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=126420.0, ans=0.0 2023-06-18 07:58:25,920 INFO [train.py:996] (0/4) Epoch 1, batch 21100, loss[loss=0.2631, simple_loss=0.3211, pruned_loss=0.1026, over 21502.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3508, pruned_loss=0.1273, over 4254471.61 frames. ], batch size: 230, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 07:58:27,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=126600.0, ans=0.09899494936611666 2023-06-18 07:58:45,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=126600.0, ans=0.025 2023-06-18 07:59:11,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=126660.0, ans=0.125 2023-06-18 07:59:12,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=126660.0, ans=0.0 2023-06-18 07:59:50,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 3.220e+02 3.817e+02 4.701e+02 7.563e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 08:00:43,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-18 08:00:50,249 INFO [train.py:996] (0/4) Epoch 1, batch 21150, loss[loss=0.2922, simple_loss=0.3268, pruned_loss=0.1288, over 16432.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3459, pruned_loss=0.1271, over 4244974.25 frames. ], batch size: 66, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:01:36,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=126960.0, ans=0.035 2023-06-18 08:01:37,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=126960.0, ans=0.0 2023-06-18 08:03:01,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=127200.0, ans=0.0 2023-06-18 08:03:11,673 INFO [train.py:996] (0/4) Epoch 1, batch 21200, loss[loss=0.2792, simple_loss=0.3164, pruned_loss=0.121, over 21260.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3428, pruned_loss=0.1259, over 4238473.02 frames. ], batch size: 159, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:03:42,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127260.0, ans=0.1 2023-06-18 08:04:35,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.266e+02 3.911e+02 4.430e+02 6.717e+02, threshold=7.823e+02, percent-clipped=0.0 2023-06-18 08:05:11,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=127440.0, ans=0.0 2023-06-18 08:05:57,565 INFO [train.py:996] (0/4) Epoch 1, batch 21250, loss[loss=0.3501, simple_loss=0.3946, pruned_loss=0.1528, over 21553.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3424, pruned_loss=0.1268, over 4236649.26 frames. ], batch size: 389, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 08:05:59,672 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:06:34,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127560.0, ans=0.125 2023-06-18 08:07:45,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=127680.0, ans=0.05 2023-06-18 08:07:58,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127740.0, ans=0.1 2023-06-18 08:08:27,259 INFO [train.py:996] (0/4) Epoch 1, batch 21300, loss[loss=0.3515, simple_loss=0.4007, pruned_loss=0.1512, over 21894.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3498, pruned_loss=0.1295, over 4249359.59 frames. ], batch size: 118, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:09:54,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=127980.0, ans=0.0 2023-06-18 08:09:55,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.217e+02 3.809e+02 4.662e+02 7.490e+02, threshold=7.618e+02, percent-clipped=0.0 2023-06-18 08:10:07,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-18 08:10:47,964 INFO [train.py:996] (0/4) Epoch 1, batch 21350, loss[loss=0.2092, simple_loss=0.2667, pruned_loss=0.07584, over 16488.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3545, pruned_loss=0.131, over 4251336.48 frames. ], batch size: 62, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:12:05,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=128220.0, ans=0.2 2023-06-18 08:13:16,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=128340.0, ans=0.125 2023-06-18 08:13:22,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=128340.0, ans=0.125 2023-06-18 08:13:22,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=128340.0, ans=0.0 2023-06-18 08:13:32,907 INFO [train.py:996] (0/4) Epoch 1, batch 21400, loss[loss=0.3498, simple_loss=0.3916, pruned_loss=0.154, over 20599.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.358, pruned_loss=0.1299, over 4259626.59 frames. ], batch size: 607, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:13:42,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128400.0, ans=0.1 2023-06-18 08:14:03,271 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:15:26,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.956e+02 3.569e+02 4.352e+02 6.315e+02, threshold=7.139e+02, percent-clipped=0.0 2023-06-18 08:15:26,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=128580.0, ans=0.125 2023-06-18 08:15:57,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=128640.0, ans=0.0 2023-06-18 08:16:25,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=128640.0, ans=0.125 2023-06-18 08:16:33,234 INFO [train.py:996] (0/4) Epoch 1, batch 21450, loss[loss=0.3641, simple_loss=0.3977, pruned_loss=0.1652, over 21783.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3624, pruned_loss=0.1328, over 4271147.12 frames. ], batch size: 441, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 08:16:46,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-18 08:17:03,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=128760.0, ans=0.125 2023-06-18 08:17:11,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128820.0, ans=0.1 2023-06-18 08:17:15,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=128820.0, ans=0.035 2023-06-18 08:18:42,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=128940.0, ans=0.0 2023-06-18 08:18:52,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=129000.0, ans=0.125 2023-06-18 08:18:59,337 INFO [train.py:996] (0/4) Epoch 1, batch 21500, loss[loss=0.3102, simple_loss=0.353, pruned_loss=0.1337, over 21997.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3604, pruned_loss=0.1341, over 4270157.51 frames. ], batch size: 103, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:19:07,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-18 08:19:29,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=129060.0, ans=0.0 2023-06-18 08:19:29,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-18 08:20:30,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 3.519e+02 4.524e+02 5.451e+02 1.077e+03, threshold=9.047e+02, percent-clipped=9.0 2023-06-18 08:21:25,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-18 08:21:28,501 INFO [train.py:996] (0/4) Epoch 1, batch 21550, loss[loss=0.2725, simple_loss=0.3211, pruned_loss=0.1119, over 21599.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3538, pruned_loss=0.1315, over 4256913.56 frames. ], batch size: 391, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:21:54,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=129360.0, ans=0.125 2023-06-18 08:22:40,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-18 08:23:34,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=129540.0, ans=0.2 2023-06-18 08:24:15,443 INFO [train.py:996] (0/4) Epoch 1, batch 21600, loss[loss=0.3267, simple_loss=0.4167, pruned_loss=0.1183, over 19692.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3495, pruned_loss=0.1291, over 4251560.75 frames. ], batch size: 703, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 08:24:25,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=129600.0, ans=0.125 2023-06-18 08:25:43,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.179e+02 3.851e+02 4.633e+02 7.831e+02, threshold=7.701e+02, percent-clipped=0.0 2023-06-18 08:25:57,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=129780.0, ans=0.0 2023-06-18 08:26:40,567 INFO [train.py:996] (0/4) Epoch 1, batch 21650, loss[loss=0.3575, simple_loss=0.4252, pruned_loss=0.1449, over 21639.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3532, pruned_loss=0.126, over 4257608.29 frames. ], batch size: 441, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:27:29,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=130020.0, ans=0.125 2023-06-18 08:28:14,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=130080.0, ans=0.0 2023-06-18 08:28:14,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=130080.0, ans=0.04949747468305833 2023-06-18 08:28:27,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=130140.0, ans=0.0 2023-06-18 08:28:49,756 INFO [train.py:996] (0/4) Epoch 1, batch 21700, loss[loss=0.2674, simple_loss=0.3191, pruned_loss=0.1078, over 21793.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3536, pruned_loss=0.1239, over 4260438.65 frames. ], batch size: 118, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:30:12,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.876e+02 3.492e+02 4.292e+02 6.287e+02, threshold=6.984e+02, percent-clipped=0.0 2023-06-18 08:30:26,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=130380.0, ans=0.125 2023-06-18 08:31:03,498 INFO [train.py:996] (0/4) Epoch 1, batch 21750, loss[loss=0.2769, simple_loss=0.3267, pruned_loss=0.1136, over 21737.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3502, pruned_loss=0.1254, over 4243549.45 frames. ], batch size: 124, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:31:03,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=130500.0, ans=0.09899494936611666 2023-06-18 08:32:04,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=130620.0, ans=0.04949747468305833 2023-06-18 08:32:28,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=130680.0, ans=0.125 2023-06-18 08:33:24,960 INFO [train.py:996] (0/4) Epoch 1, batch 21800, loss[loss=0.3186, simple_loss=0.3722, pruned_loss=0.1325, over 21654.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3479, pruned_loss=0.1255, over 4240443.21 frames. ], batch size: 298, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 08:33:50,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=130800.0, ans=0.05 2023-06-18 08:33:53,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=130860.0, ans=0.0 2023-06-18 08:34:26,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=130860.0, ans=0.125 2023-06-18 08:35:17,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.102e+02 3.611e+02 4.260e+02 6.998e+02, threshold=7.223e+02, percent-clipped=1.0 2023-06-18 08:36:06,416 INFO [train.py:996] (0/4) Epoch 1, batch 21850, loss[loss=0.3889, simple_loss=0.4494, pruned_loss=0.1642, over 19749.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3528, pruned_loss=0.1265, over 4242531.23 frames. ], batch size: 702, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:36:18,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=131100.0, ans=0.125 2023-06-18 08:38:50,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131340.0, ans=0.1 2023-06-18 08:38:54,592 INFO [train.py:996] (0/4) Epoch 1, batch 21900, loss[loss=0.3502, simple_loss=0.4094, pruned_loss=0.1455, over 21564.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3548, pruned_loss=0.128, over 4256128.16 frames. ], batch size: 471, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:39:56,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=131520.0, ans=0.125 2023-06-18 08:40:06,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=131520.0, ans=0.125 2023-06-18 08:40:14,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.557e+02 4.079e+02 5.086e+02 8.901e+02, threshold=8.158e+02, percent-clipped=3.0 2023-06-18 08:40:15,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=131580.0, ans=0.125 2023-06-18 08:40:37,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=131640.0, ans=0.125 2023-06-18 08:40:42,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=131640.0, ans=0.125 2023-06-18 08:40:45,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=131640.0, ans=0.125 2023-06-18 08:40:59,001 INFO [train.py:996] (0/4) Epoch 1, batch 21950, loss[loss=0.2271, simple_loss=0.2982, pruned_loss=0.07801, over 21685.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3485, pruned_loss=0.1265, over 4252861.62 frames. ], batch size: 316, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:41:13,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=131700.0, ans=0.125 2023-06-18 08:41:50,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-18 08:42:35,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=131880.0, ans=0.02 2023-06-18 08:42:41,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=131880.0, ans=0.2 2023-06-18 08:42:59,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=131940.0, ans=0.2 2023-06-18 08:43:15,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=131940.0, ans=0.0 2023-06-18 08:43:18,610 INFO [train.py:996] (0/4) Epoch 1, batch 22000, loss[loss=0.2665, simple_loss=0.3176, pruned_loss=0.1077, over 21159.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3401, pruned_loss=0.1221, over 4251626.70 frames. ], batch size: 159, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 08:43:37,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=132000.0, ans=0.125 2023-06-18 08:45:03,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 3.001e+02 3.656e+02 5.158e+02 1.119e+03, threshold=7.313e+02, percent-clipped=4.0 2023-06-18 08:45:35,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=132180.0, ans=0.0 2023-06-18 08:46:05,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=132240.0, ans=15.0 2023-06-18 08:46:09,169 INFO [train.py:996] (0/4) Epoch 1, batch 22050, loss[loss=0.3866, simple_loss=0.4388, pruned_loss=0.1672, over 21294.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3456, pruned_loss=0.1241, over 4250883.75 frames. ], batch size: 549, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:46:37,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=132300.0, ans=0.025 2023-06-18 08:46:37,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-18 08:47:35,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132480.0, ans=0.0 2023-06-18 08:47:47,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132480.0, ans=0.1 2023-06-18 08:47:56,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=132480.0, ans=0.125 2023-06-18 08:48:43,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132540.0, ans=0.1 2023-06-18 08:48:48,912 INFO [train.py:996] (0/4) Epoch 1, batch 22100, loss[loss=0.3411, simple_loss=0.3827, pruned_loss=0.1497, over 21281.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3587, pruned_loss=0.1316, over 4261663.09 frames. ], batch size: 143, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:48:59,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=132600.0, ans=0.125 2023-06-18 08:49:38,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132660.0, ans=0.1 2023-06-18 08:49:38,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132660.0, ans=0.1 2023-06-18 08:50:17,072 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.342e+02 4.040e+02 5.215e+02 7.946e+02, threshold=8.079e+02, percent-clipped=2.0 2023-06-18 08:51:25,518 INFO [train.py:996] (0/4) Epoch 1, batch 22150, loss[loss=0.3275, simple_loss=0.3741, pruned_loss=0.1404, over 21742.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3617, pruned_loss=0.1327, over 4270133.22 frames. ], batch size: 389, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:51:38,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.64 vs. limit=15.0 2023-06-18 08:51:39,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132900.0, ans=0.125 2023-06-18 08:51:47,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-18 08:51:47,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-18 08:52:20,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-18 08:52:36,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-18 08:53:24,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133080.0, ans=0.1 2023-06-18 08:53:35,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-18 08:53:44,053 INFO [train.py:996] (0/4) Epoch 1, batch 22200, loss[loss=0.3886, simple_loss=0.4751, pruned_loss=0.1511, over 19647.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3651, pruned_loss=0.1349, over 4276729.99 frames. ], batch size: 702, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 08:54:53,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133260.0, ans=0.1 2023-06-18 08:54:55,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=133260.0, ans=0.125 2023-06-18 08:55:03,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-18 08:55:04,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=133320.0, ans=0.2 2023-06-18 08:55:43,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.063e+02 3.788e+02 4.693e+02 1.107e+03, threshold=7.575e+02, percent-clipped=1.0 2023-06-18 08:55:46,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.84 vs. limit=10.0 2023-06-18 08:55:48,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=133380.0, ans=0.04949747468305833 2023-06-18 08:56:47,420 INFO [train.py:996] (0/4) Epoch 1, batch 22250, loss[loss=0.2992, simple_loss=0.3406, pruned_loss=0.1288, over 21205.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3724, pruned_loss=0.1368, over 4272144.88 frames. ], batch size: 608, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 08:56:49,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133500.0, ans=0.1 2023-06-18 08:56:50,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=133500.0, ans=0.2 2023-06-18 08:56:52,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=133500.0, ans=0.125 2023-06-18 08:56:59,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=133500.0, ans=0.0 2023-06-18 08:58:26,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=133680.0, ans=0.125 2023-06-18 08:58:32,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=133680.0, ans=0.0 2023-06-18 08:58:55,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133740.0, ans=0.1 2023-06-18 08:59:08,862 INFO [train.py:996] (0/4) Epoch 1, batch 22300, loss[loss=0.4111, simple_loss=0.4246, pruned_loss=0.1988, over 21639.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3764, pruned_loss=0.1409, over 4276611.27 frames. ], batch size: 473, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 08:59:51,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=133860.0, ans=0.125 2023-06-18 09:00:23,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=133860.0, ans=0.2 2023-06-18 09:00:53,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.379e+02 4.077e+02 4.905e+02 1.025e+03, threshold=8.153e+02, percent-clipped=2.0 2023-06-18 09:01:32,671 INFO [train.py:996] (0/4) Epoch 1, batch 22350, loss[loss=0.2969, simple_loss=0.3366, pruned_loss=0.1285, over 21670.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3744, pruned_loss=0.141, over 4279681.69 frames. ], batch size: 263, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:01:48,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=134100.0, ans=0.125 2023-06-18 09:03:02,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=134220.0, ans=0.125 2023-06-18 09:03:02,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=134220.0, ans=0.125 2023-06-18 09:03:13,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=134220.0, ans=0.125 2023-06-18 09:04:12,522 INFO [train.py:996] (0/4) Epoch 1, batch 22400, loss[loss=0.2806, simple_loss=0.328, pruned_loss=0.1166, over 21465.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3691, pruned_loss=0.1358, over 4286098.14 frames. ], batch size: 212, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 09:04:20,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=134400.0, ans=0.0 2023-06-18 09:04:21,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=134400.0, ans=0.125 2023-06-18 09:05:00,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=134460.0, ans=0.125 2023-06-18 09:05:34,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=134520.0, ans=0.04949747468305833 2023-06-18 09:05:44,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.212e+02 4.054e+02 4.635e+02 7.635e+02, threshold=8.108e+02, percent-clipped=0.0 2023-06-18 09:06:46,093 INFO [train.py:996] (0/4) Epoch 1, batch 22450, loss[loss=0.2894, simple_loss=0.3321, pruned_loss=0.1233, over 21895.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3612, pruned_loss=0.1338, over 4279253.80 frames. ], batch size: 373, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:06:57,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=134700.0, ans=0.125 2023-06-18 09:06:57,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-18 09:07:36,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=134760.0, ans=0.125 2023-06-18 09:09:37,349 INFO [train.py:996] (0/4) Epoch 1, batch 22500, loss[loss=0.3622, simple_loss=0.4237, pruned_loss=0.1503, over 21214.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3547, pruned_loss=0.1327, over 4273538.13 frames. ], batch size: 549, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:09:38,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-18 09:09:56,866 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:10:18,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=135000.0, ans=0.0 2023-06-18 09:10:35,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=135060.0, ans=0.125 2023-06-18 09:10:41,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=135120.0, ans=0.2 2023-06-18 09:10:59,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135180.0, ans=0.1 2023-06-18 09:11:15,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.294e+02 4.262e+02 5.582e+02 1.060e+03, threshold=8.524e+02, percent-clipped=5.0 2023-06-18 09:12:15,200 INFO [train.py:996] (0/4) Epoch 1, batch 22550, loss[loss=0.3606, simple_loss=0.3906, pruned_loss=0.1653, over 21785.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3597, pruned_loss=0.133, over 4277886.61 frames. ], batch size: 441, lr: 2.53e-02, grad_scale: 64.0 2023-06-18 09:13:01,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=135360.0, ans=0.125 2023-06-18 09:13:50,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135480.0, ans=0.1 2023-06-18 09:14:46,448 INFO [train.py:996] (0/4) Epoch 1, batch 22600, loss[loss=0.2682, simple_loss=0.3108, pruned_loss=0.1128, over 21193.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3631, pruned_loss=0.1337, over 4284765.42 frames. ], batch size: 159, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 09:15:09,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.04 vs. limit=6.0 2023-06-18 09:15:39,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135660.0, ans=0.1 2023-06-18 09:15:47,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=135720.0, ans=0.125 2023-06-18 09:16:04,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.237e+02 4.063e+02 5.145e+02 1.049e+03, threshold=8.126e+02, percent-clipped=2.0 2023-06-18 09:16:49,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=135840.0, ans=0.125 2023-06-18 09:17:09,568 INFO [train.py:996] (0/4) Epoch 1, batch 22650, loss[loss=0.2805, simple_loss=0.3203, pruned_loss=0.1203, over 21113.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3593, pruned_loss=0.1319, over 4274870.06 frames. ], batch size: 176, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:17:12,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=135900.0, ans=0.0 2023-06-18 09:17:48,993 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:19:02,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=136140.0, ans=0.0 2023-06-18 09:19:33,367 INFO [train.py:996] (0/4) Epoch 1, batch 22700, loss[loss=0.2881, simple_loss=0.348, pruned_loss=0.1141, over 20687.00 frames. ], tot_loss[loss=0.308, simple_loss=0.3534, pruned_loss=0.1313, over 4278372.14 frames. ], batch size: 607, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:19:33,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=136200.0, ans=0.2 2023-06-18 09:19:45,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=136200.0, ans=0.125 2023-06-18 09:19:58,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=136260.0, ans=0.025 2023-06-18 09:20:03,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-18 09:20:17,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136260.0, ans=0.0 2023-06-18 09:20:42,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-18 09:21:14,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.452e+02 4.145e+02 4.705e+02 7.533e+02, threshold=8.290e+02, percent-clipped=0.0 2023-06-18 09:21:36,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=136440.0, ans=0.0 2023-06-18 09:22:04,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=136440.0, ans=0.125 2023-06-18 09:22:06,624 INFO [train.py:996] (0/4) Epoch 1, batch 22750, loss[loss=0.3614, simple_loss=0.3994, pruned_loss=0.1617, over 21248.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3555, pruned_loss=0.1346, over 4282330.24 frames. ], batch size: 549, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:22:25,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=136500.0, ans=0.0 2023-06-18 09:23:36,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-18 09:23:38,359 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:23:38,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-06-18 09:23:41,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136680.0, ans=0.1 2023-06-18 09:23:48,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=136680.0, ans=0.05 2023-06-18 09:23:54,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136680.0, ans=0.1 2023-06-18 09:24:21,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136740.0, ans=0.125 2023-06-18 09:24:27,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=136740.0, ans=0.125 2023-06-18 09:24:39,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.30 vs. limit=22.5 2023-06-18 09:24:40,983 INFO [train.py:996] (0/4) Epoch 1, batch 22800, loss[loss=0.3347, simple_loss=0.3888, pruned_loss=0.1403, over 21851.00 frames. ], tot_loss[loss=0.319, simple_loss=0.3619, pruned_loss=0.138, over 4285532.83 frames. ], batch size: 118, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 09:26:06,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.270e+02 3.833e+02 4.673e+02 9.508e+02, threshold=7.666e+02, percent-clipped=3.0 2023-06-18 09:26:10,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.16 vs. limit=15.0 2023-06-18 09:27:12,212 INFO [train.py:996] (0/4) Epoch 1, batch 22850, loss[loss=0.3019, simple_loss=0.3392, pruned_loss=0.1323, over 22041.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3602, pruned_loss=0.1372, over 4285878.94 frames. ], batch size: 103, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:28:11,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=137220.0, ans=0.125 2023-06-18 09:29:00,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=137280.0, ans=0.05 2023-06-18 09:29:42,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=137340.0, ans=0.0 2023-06-18 09:29:46,660 INFO [train.py:996] (0/4) Epoch 1, batch 22900, loss[loss=0.3974, simple_loss=0.475, pruned_loss=0.1599, over 21460.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3601, pruned_loss=0.1357, over 4279119.47 frames. ], batch size: 471, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:30:04,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=137400.0, ans=0.125 2023-06-18 09:30:22,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=137460.0, ans=0.5 2023-06-18 09:31:07,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=137520.0, ans=0.2 2023-06-18 09:31:37,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.304e+02 3.880e+02 4.906e+02 7.646e+02, threshold=7.759e+02, percent-clipped=0.0 2023-06-18 09:31:46,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-18 09:31:53,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=137580.0, ans=0.125 2023-06-18 09:32:27,089 INFO [train.py:996] (0/4) Epoch 1, batch 22950, loss[loss=0.2957, simple_loss=0.4159, pruned_loss=0.08779, over 20752.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3733, pruned_loss=0.134, over 4282594.21 frames. ], batch size: 607, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:32:27,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137700.0, ans=0.1 2023-06-18 09:33:40,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=137820.0, ans=0.0 2023-06-18 09:34:00,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-18 09:34:01,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-18 09:35:10,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2023-06-18 09:35:10,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-18 09:35:12,971 INFO [train.py:996] (0/4) Epoch 1, batch 23000, loss[loss=0.3738, simple_loss=0.3997, pruned_loss=0.1739, over 21625.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3734, pruned_loss=0.1298, over 4272855.68 frames. ], batch size: 471, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 09:35:40,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=138000.0, ans=0.0 2023-06-18 09:36:53,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.985e+02 3.505e+02 4.304e+02 7.318e+02, threshold=7.010e+02, percent-clipped=0.0 2023-06-18 09:37:06,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=138180.0, ans=0.125 2023-06-18 09:38:12,249 INFO [train.py:996] (0/4) Epoch 1, batch 23050, loss[loss=0.3519, simple_loss=0.3985, pruned_loss=0.1526, over 21709.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.3763, pruned_loss=0.1336, over 4274490.88 frames. ], batch size: 351, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:38:28,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138300.0, ans=0.1 2023-06-18 09:40:27,162 INFO [train.py:996] (0/4) Epoch 1, batch 23100, loss[loss=0.2729, simple_loss=0.3204, pruned_loss=0.1128, over 21893.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3715, pruned_loss=0.1344, over 4280890.04 frames. ], batch size: 373, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:40:27,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138600.0, ans=0.1 2023-06-18 09:41:04,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=138660.0, ans=0.125 2023-06-18 09:41:57,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=138720.0, ans=0.0 2023-06-18 09:42:11,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.111e+02 3.602e+02 4.415e+02 8.420e+02, threshold=7.204e+02, percent-clipped=6.0 2023-06-18 09:42:41,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=138840.0, ans=0.125 2023-06-18 09:43:11,445 INFO [train.py:996] (0/4) Epoch 1, batch 23150, loss[loss=0.2599, simple_loss=0.291, pruned_loss=0.1144, over 20767.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3627, pruned_loss=0.132, over 4284907.79 frames. ], batch size: 609, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:43:20,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=138900.0, ans=0.0 2023-06-18 09:44:02,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=139020.0, ans=0.125 2023-06-18 09:45:38,685 INFO [train.py:996] (0/4) Epoch 1, batch 23200, loss[loss=0.3339, simple_loss=0.3718, pruned_loss=0.148, over 21859.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3603, pruned_loss=0.1319, over 4283178.09 frames. ], batch size: 391, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 09:46:13,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=139260.0, ans=0.05 2023-06-18 09:47:28,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.236e+02 3.747e+02 4.592e+02 8.495e+02, threshold=7.495e+02, percent-clipped=1.0 2023-06-18 09:47:30,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139380.0, ans=0.1 2023-06-18 09:47:35,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=139380.0, ans=0.95 2023-06-18 09:48:05,952 INFO [train.py:996] (0/4) Epoch 1, batch 23250, loss[loss=0.328, simple_loss=0.3715, pruned_loss=0.1423, over 21902.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3614, pruned_loss=0.1336, over 4291011.18 frames. ], batch size: 333, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:49:26,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=139620.0, ans=0.07 2023-06-18 09:50:36,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-18 09:50:40,071 INFO [train.py:996] (0/4) Epoch 1, batch 23300, loss[loss=0.3211, simple_loss=0.411, pruned_loss=0.1156, over 21397.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3708, pruned_loss=0.136, over 4288659.92 frames. ], batch size: 211, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:51:06,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=139800.0, ans=0.0 2023-06-18 09:52:00,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-18 09:52:07,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=139860.0, ans=0.0 2023-06-18 09:52:10,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139920.0, ans=0.1 2023-06-18 09:52:33,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.238e+02 4.090e+02 5.618e+02 9.282e+02, threshold=8.181e+02, percent-clipped=7.0 2023-06-18 09:52:34,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=139980.0, ans=0.0 2023-06-18 09:52:49,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-18 09:53:26,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=140100.0, ans=0.2 2023-06-18 09:53:27,947 INFO [train.py:996] (0/4) Epoch 1, batch 23350, loss[loss=0.2366, simple_loss=0.3048, pruned_loss=0.08417, over 21608.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3761, pruned_loss=0.1351, over 4286210.60 frames. ], batch size: 230, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:53:28,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=140100.0, ans=0.0 2023-06-18 09:53:44,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-18 09:53:46,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=140100.0, ans=0.5 2023-06-18 09:55:35,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=140280.0, ans=0.0 2023-06-18 09:55:53,816 INFO [train.py:996] (0/4) Epoch 1, batch 23400, loss[loss=0.2917, simple_loss=0.3493, pruned_loss=0.1171, over 21518.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3696, pruned_loss=0.1311, over 4287226.20 frames. ], batch size: 211, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 09:55:57,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=140400.0, ans=0.2 2023-06-18 09:57:08,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=140520.0, ans=0.0 2023-06-18 09:57:45,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.809e+02 3.311e+02 4.029e+02 8.339e+02, threshold=6.622e+02, percent-clipped=1.0 2023-06-18 09:58:27,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=140640.0, ans=0.0 2023-06-18 09:58:45,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=140640.0, ans=0.125 2023-06-18 09:58:52,847 INFO [train.py:996] (0/4) Epoch 1, batch 23450, loss[loss=0.4425, simple_loss=0.4448, pruned_loss=0.22, over 21346.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.372, pruned_loss=0.1355, over 4296488.66 frames. ], batch size: 507, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 09:58:54,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140700.0, ans=0.1 2023-06-18 09:59:08,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=140700.0, ans=0.2 2023-06-18 10:00:07,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=140820.0, ans=0.0 2023-06-18 10:00:49,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140880.0, ans=0.125 2023-06-18 10:01:28,659 INFO [train.py:996] (0/4) Epoch 1, batch 23500, loss[loss=0.2967, simple_loss=0.3356, pruned_loss=0.1289, over 21155.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3707, pruned_loss=0.1366, over 4292187.03 frames. ], batch size: 607, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:02:14,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=141060.0, ans=0.0 2023-06-18 10:02:17,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=141060.0, ans=0.0 2023-06-18 10:02:34,904 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:03:03,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.176e+02 3.933e+02 4.936e+02 8.356e+02, threshold=7.866e+02, percent-clipped=7.0 2023-06-18 10:03:19,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141240.0, ans=0.1 2023-06-18 10:03:34,755 INFO [train.py:996] (0/4) Epoch 1, batch 23550, loss[loss=0.2885, simple_loss=0.33, pruned_loss=0.1234, over 21319.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3655, pruned_loss=0.1354, over 4285228.37 frames. ], batch size: 131, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:03:48,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=141300.0, ans=0.2 2023-06-18 10:03:56,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=141300.0, ans=0.125 2023-06-18 10:04:07,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=141300.0, ans=0.2 2023-06-18 10:04:55,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=12.0 2023-06-18 10:06:23,365 INFO [train.py:996] (0/4) Epoch 1, batch 23600, loss[loss=0.3565, simple_loss=0.4074, pruned_loss=0.1528, over 21237.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3676, pruned_loss=0.1364, over 4287543.71 frames. ], batch size: 143, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 10:06:23,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141600.0, ans=0.1 2023-06-18 10:08:07,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=141780.0, ans=0.125 2023-06-18 10:08:07,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=141780.0, ans=0.0 2023-06-18 10:08:08,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 3.095e+02 3.693e+02 4.407e+02 6.966e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-18 10:08:09,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=141780.0, ans=0.125 2023-06-18 10:08:50,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=141840.0, ans=0.2 2023-06-18 10:09:12,735 INFO [train.py:996] (0/4) Epoch 1, batch 23650, loss[loss=0.3955, simple_loss=0.4349, pruned_loss=0.1781, over 21575.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3673, pruned_loss=0.1349, over 4280727.01 frames. ], batch size: 414, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:10:07,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=141960.0, ans=0.0 2023-06-18 10:11:51,785 INFO [train.py:996] (0/4) Epoch 1, batch 23700, loss[loss=0.2336, simple_loss=0.3052, pruned_loss=0.08097, over 21616.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3705, pruned_loss=0.1345, over 4281086.01 frames. ], batch size: 230, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:13:02,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=142320.0, ans=0.125 2023-06-18 10:13:43,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=142380.0, ans=0.0 2023-06-18 10:13:43,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=142380.0, ans=0.125 2023-06-18 10:13:44,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.277e+02 3.827e+02 4.804e+02 8.451e+02, threshold=7.655e+02, percent-clipped=1.0 2023-06-18 10:14:20,757 INFO [train.py:996] (0/4) Epoch 1, batch 23750, loss[loss=0.3721, simple_loss=0.4146, pruned_loss=0.1648, over 21768.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3727, pruned_loss=0.1348, over 4276691.25 frames. ], batch size: 441, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:14:24,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-18 10:14:47,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-18 10:15:16,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142560.0, ans=0.0 2023-06-18 10:15:23,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=142560.0, ans=0.125 2023-06-18 10:16:15,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=142680.0, ans=0.125 2023-06-18 10:17:13,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=142740.0, ans=0.125 2023-06-18 10:17:15,877 INFO [train.py:996] (0/4) Epoch 1, batch 23800, loss[loss=0.3912, simple_loss=0.4606, pruned_loss=0.161, over 21218.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3694, pruned_loss=0.131, over 4268705.22 frames. ], batch size: 548, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 10:18:50,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=142920.0, ans=0.0 2023-06-18 10:19:06,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.371e+02 4.073e+02 5.438e+02 8.873e+02, threshold=8.146e+02, percent-clipped=4.0 2023-06-18 10:19:07,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=142980.0, ans=0.125 2023-06-18 10:19:07,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=142980.0, ans=0.2 2023-06-18 10:19:13,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142980.0, ans=0.0 2023-06-18 10:20:03,706 INFO [train.py:996] (0/4) Epoch 1, batch 23850, loss[loss=0.4349, simple_loss=0.4465, pruned_loss=0.2117, over 21405.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.3826, pruned_loss=0.1363, over 4261815.59 frames. ], batch size: 471, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:21:28,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=143280.0, ans=0.125 2023-06-18 10:22:38,204 INFO [train.py:996] (0/4) Epoch 1, batch 23900, loss[loss=0.425, simple_loss=0.4716, pruned_loss=0.1892, over 21446.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3903, pruned_loss=0.1387, over 4262313.65 frames. ], batch size: 471, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:22:38,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=143400.0, ans=0.0 2023-06-18 10:24:08,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=143520.0, ans=0.0 2023-06-18 10:24:14,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.646e+02 4.391e+02 5.259e+02 8.608e+02, threshold=8.781e+02, percent-clipped=3.0 2023-06-18 10:24:18,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=12.0 2023-06-18 10:24:33,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143640.0, ans=0.1 2023-06-18 10:24:38,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=143640.0, ans=0.0 2023-06-18 10:24:44,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=143640.0, ans=0.125 2023-06-18 10:24:56,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.19 vs. limit=6.0 2023-06-18 10:25:07,018 INFO [train.py:996] (0/4) Epoch 1, batch 23950, loss[loss=0.3259, simple_loss=0.358, pruned_loss=0.1469, over 20644.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3806, pruned_loss=0.1372, over 4262070.46 frames. ], batch size: 607, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:26:03,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=143820.0, ans=0.0 2023-06-18 10:26:12,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=143820.0, ans=0.125 2023-06-18 10:26:37,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=143880.0, ans=0.0 2023-06-18 10:27:34,542 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-24000.pt 2023-06-18 10:27:39,009 INFO [train.py:996] (0/4) Epoch 1, batch 24000, loss[loss=0.3799, simple_loss=0.4165, pruned_loss=0.1717, over 21584.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3813, pruned_loss=0.1404, over 4264029.19 frames. ], batch size: 389, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:27:39,010 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 10:28:35,854 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3093, simple_loss=0.4026, pruned_loss=0.108, over 1796401.00 frames. 2023-06-18 10:28:35,855 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 10:29:10,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=144120.0, ans=0.125 2023-06-18 10:29:41,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=144120.0, ans=0.0 2023-06-18 10:29:54,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.332e+02 4.254e+02 5.266e+02 8.160e+02, threshold=8.508e+02, percent-clipped=0.0 2023-06-18 10:29:55,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=144180.0, ans=0.1 2023-06-18 10:30:57,458 INFO [train.py:996] (0/4) Epoch 1, batch 24050, loss[loss=0.2892, simple_loss=0.3533, pruned_loss=0.1125, over 21151.00 frames. ], tot_loss[loss=0.3326, simple_loss=0.3833, pruned_loss=0.141, over 4275100.78 frames. ], batch size: 143, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 10:31:00,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=22.5 2023-06-18 10:32:34,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=144480.0, ans=0.0 2023-06-18 10:32:53,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=144540.0, ans=0.125 2023-06-18 10:33:33,010 INFO [train.py:996] (0/4) Epoch 1, batch 24100, loss[loss=0.3785, simple_loss=0.427, pruned_loss=0.165, over 21557.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3811, pruned_loss=0.1368, over 4276462.90 frames. ], batch size: 414, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:33:49,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.74 vs. limit=22.5 2023-06-18 10:34:13,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=144660.0, ans=0.125 2023-06-18 10:34:23,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.79 vs. limit=10.0 2023-06-18 10:35:11,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.942e+02 3.455e+02 4.270e+02 6.520e+02, threshold=6.911e+02, percent-clipped=0.0 2023-06-18 10:35:43,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144840.0, ans=0.1 2023-06-18 10:35:59,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=144840.0, ans=0.125 2023-06-18 10:36:00,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=144840.0, ans=0.125 2023-06-18 10:36:07,970 INFO [train.py:996] (0/4) Epoch 1, batch 24150, loss[loss=0.3722, simple_loss=0.3933, pruned_loss=0.1756, over 21802.00 frames. ], tot_loss[loss=0.329, simple_loss=0.3799, pruned_loss=0.139, over 4281423.66 frames. ], batch size: 441, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:36:09,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144900.0, ans=0.125 2023-06-18 10:36:46,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-18 10:37:43,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=145020.0, ans=0.125 2023-06-18 10:37:46,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145080.0, ans=0.1 2023-06-18 10:37:56,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145080.0, ans=0.1 2023-06-18 10:38:47,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-18 10:38:51,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=145200.0, ans=0.0 2023-06-18 10:38:51,875 INFO [train.py:996] (0/4) Epoch 1, batch 24200, loss[loss=0.2917, simple_loss=0.3561, pruned_loss=0.1136, over 21629.00 frames. ], tot_loss[loss=0.3306, simple_loss=0.3807, pruned_loss=0.1402, over 4282049.56 frames. ], batch size: 230, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:39:46,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-18 10:39:48,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=145260.0, ans=0.125 2023-06-18 10:39:48,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=145260.0, ans=0.125 2023-06-18 10:40:14,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145320.0, ans=0.1 2023-06-18 10:40:26,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.21 vs. limit=15.0 2023-06-18 10:40:52,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.211e+02 3.713e+02 4.358e+02 6.625e+02, threshold=7.425e+02, percent-clipped=0.0 2023-06-18 10:41:47,347 INFO [train.py:996] (0/4) Epoch 1, batch 24250, loss[loss=0.2503, simple_loss=0.329, pruned_loss=0.08579, over 21291.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3757, pruned_loss=0.1302, over 4283037.90 frames. ], batch size: 176, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 10:42:35,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=145560.0, ans=0.125 2023-06-18 10:44:28,669 INFO [train.py:996] (0/4) Epoch 1, batch 24300, loss[loss=0.2417, simple_loss=0.3119, pruned_loss=0.08579, over 21770.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3656, pruned_loss=0.122, over 4272798.35 frames. ], batch size: 332, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:44:47,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-18 10:44:48,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=145800.0, ans=0.0 2023-06-18 10:44:58,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=145860.0, ans=0.2 2023-06-18 10:45:25,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=145860.0, ans=0.2 2023-06-18 10:45:41,667 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:46:15,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.565e+02 3.464e+02 4.560e+02 9.381e+02, threshold=6.928e+02, percent-clipped=3.0 2023-06-18 10:46:38,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2023-06-18 10:46:52,482 INFO [train.py:996] (0/4) Epoch 1, batch 24350, loss[loss=0.281, simple_loss=0.3316, pruned_loss=0.1152, over 21476.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3614, pruned_loss=0.1231, over 4274345.30 frames. ], batch size: 177, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:48:39,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=146280.0, ans=0.05 2023-06-18 10:48:44,094 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:49:48,882 INFO [train.py:996] (0/4) Epoch 1, batch 24400, loss[loss=0.3018, simple_loss=0.354, pruned_loss=0.1248, over 21319.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3678, pruned_loss=0.1295, over 4275253.94 frames. ], batch size: 548, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:49:49,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146400.0, ans=0.1 2023-06-18 10:49:54,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=146400.0, ans=0.2 2023-06-18 10:50:52,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146520.0, ans=0.1 2023-06-18 10:51:09,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.327e+02 4.021e+02 4.896e+02 7.862e+02, threshold=8.041e+02, percent-clipped=3.0 2023-06-18 10:51:11,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=146580.0, ans=0.125 2023-06-18 10:51:34,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=146580.0, ans=0.125 2023-06-18 10:52:15,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=146640.0, ans=0.2 2023-06-18 10:52:18,230 INFO [train.py:996] (0/4) Epoch 1, batch 24450, loss[loss=0.3362, simple_loss=0.4087, pruned_loss=0.1318, over 21727.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3701, pruned_loss=0.1302, over 4268321.85 frames. ], batch size: 414, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 10:52:18,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=146700.0, ans=0.2 2023-06-18 10:52:25,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146700.0, ans=0.1 2023-06-18 10:54:48,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=147000.0, ans=0.0 2023-06-18 10:54:49,369 INFO [train.py:996] (0/4) Epoch 1, batch 24500, loss[loss=0.3494, simple_loss=0.3863, pruned_loss=0.1562, over 21789.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3696, pruned_loss=0.1305, over 4266025.73 frames. ], batch size: 441, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 10:54:52,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=147000.0, ans=0.2 2023-06-18 10:55:03,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147000.0, ans=0.125 2023-06-18 10:55:03,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147000.0, ans=0.125 2023-06-18 10:55:21,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-18 10:55:39,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=147060.0, ans=0.05 2023-06-18 10:56:44,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=147180.0, ans=0.125 2023-06-18 10:56:45,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.166e+02 3.756e+02 4.568e+02 6.399e+02, threshold=7.511e+02, percent-clipped=0.0 2023-06-18 10:56:48,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-18 10:57:32,113 INFO [train.py:996] (0/4) Epoch 1, batch 24550, loss[loss=0.4424, simple_loss=0.4628, pruned_loss=0.211, over 21338.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.3748, pruned_loss=0.1353, over 4273279.90 frames. ], batch size: 507, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 10:57:48,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=147300.0, ans=15.0 2023-06-18 10:57:49,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=147300.0, ans=0.125 2023-06-18 10:58:29,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=147360.0, ans=0.0 2023-06-18 10:58:47,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.42 vs. limit=5.0 2023-06-18 10:58:54,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-18 10:58:57,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147480.0, ans=0.0 2023-06-18 11:00:07,912 INFO [train.py:996] (0/4) Epoch 1, batch 24600, loss[loss=0.3015, simple_loss=0.3505, pruned_loss=0.1263, over 21815.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3697, pruned_loss=0.1355, over 4271895.14 frames. ], batch size: 352, lr: 2.43e-02, grad_scale: 64.0 2023-06-18 11:00:18,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=147600.0, ans=0.0 2023-06-18 11:00:18,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147600.0, ans=0.125 2023-06-18 11:01:26,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.281e+02 4.105e+02 4.989e+02 7.810e+02, threshold=8.210e+02, percent-clipped=2.0 2023-06-18 11:01:31,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=147780.0, ans=0.125 2023-06-18 11:02:12,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147840.0, ans=0.1 2023-06-18 11:02:16,245 INFO [train.py:996] (0/4) Epoch 1, batch 24650, loss[loss=0.2668, simple_loss=0.3228, pruned_loss=0.1054, over 15718.00 frames. ], tot_loss[loss=0.3138, simple_loss=0.3616, pruned_loss=0.133, over 4272437.78 frames. ], batch size: 63, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 11:03:04,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=147960.0, ans=0.125 2023-06-18 11:03:05,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=147960.0, ans=0.07 2023-06-18 11:03:10,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=147960.0, ans=0.0 2023-06-18 11:03:29,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=148020.0, ans=0.125 2023-06-18 11:03:42,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=148020.0, ans=0.0 2023-06-18 11:04:22,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=148140.0, ans=0.2 2023-06-18 11:04:42,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=148140.0, ans=0.125 2023-06-18 11:04:48,088 INFO [train.py:996] (0/4) Epoch 1, batch 24700, loss[loss=0.2849, simple_loss=0.335, pruned_loss=0.1174, over 21457.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3576, pruned_loss=0.1298, over 4277408.47 frames. ], batch size: 212, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 11:04:56,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=148200.0, ans=0.125 2023-06-18 11:06:01,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-18 11:06:10,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.936e+02 3.514e+02 4.339e+02 5.892e+02, threshold=7.028e+02, percent-clipped=0.0 2023-06-18 11:06:33,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148380.0, ans=0.1 2023-06-18 11:07:20,637 INFO [train.py:996] (0/4) Epoch 1, batch 24750, loss[loss=0.2676, simple_loss=0.3143, pruned_loss=0.1104, over 21989.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.35, pruned_loss=0.1262, over 4273225.79 frames. ], batch size: 119, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:07:42,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=148560.0, ans=0.0 2023-06-18 11:08:11,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148620.0, ans=0.1 2023-06-18 11:09:16,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-18 11:09:44,199 INFO [train.py:996] (0/4) Epoch 1, batch 24800, loss[loss=0.333, simple_loss=0.368, pruned_loss=0.1491, over 21819.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3457, pruned_loss=0.1267, over 4277788.50 frames. ], batch size: 414, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:09:45,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.03 vs. limit=10.0 2023-06-18 11:10:25,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-18 11:10:27,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=148920.0, ans=0.125 2023-06-18 11:11:21,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.310e+02 3.842e+02 5.008e+02 1.003e+03, threshold=7.684e+02, percent-clipped=5.0 2023-06-18 11:11:30,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=149040.0, ans=0.035 2023-06-18 11:12:17,381 INFO [train.py:996] (0/4) Epoch 1, batch 24850, loss[loss=0.234, simple_loss=0.281, pruned_loss=0.09353, over 21342.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3473, pruned_loss=0.1281, over 4288608.91 frames. ], batch size: 131, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:12:28,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=149100.0, ans=0.125 2023-06-18 11:12:57,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149220.0, ans=0.0 2023-06-18 11:13:37,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=149280.0, ans=0.0 2023-06-18 11:13:37,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-18 11:14:24,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=149340.0, ans=0.125 2023-06-18 11:14:38,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=149400.0, ans=0.125 2023-06-18 11:14:38,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=149400.0, ans=0.125 2023-06-18 11:14:39,666 INFO [train.py:996] (0/4) Epoch 1, batch 24900, loss[loss=0.3293, simple_loss=0.3854, pruned_loss=0.1366, over 21418.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3513, pruned_loss=0.1288, over 4287980.27 frames. ], batch size: 131, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 11:15:10,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=149400.0, ans=0.0 2023-06-18 11:15:18,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-18 11:16:38,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.164e+02 3.727e+02 4.575e+02 6.932e+02, threshold=7.454e+02, percent-clipped=0.0 2023-06-18 11:17:04,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=149640.0, ans=0.125 2023-06-18 11:17:28,498 INFO [train.py:996] (0/4) Epoch 1, batch 24950, loss[loss=0.4758, simple_loss=0.4774, pruned_loss=0.2371, over 21435.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.3627, pruned_loss=0.1364, over 4288683.19 frames. ], batch size: 510, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 11:17:54,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=149700.0, ans=0.0 2023-06-18 11:17:56,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=149700.0, ans=0.2 2023-06-18 11:18:25,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=149760.0, ans=0.125 2023-06-18 11:19:29,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.61 vs. limit=22.5 2023-06-18 11:20:09,049 INFO [train.py:996] (0/4) Epoch 1, batch 25000, loss[loss=0.305, simple_loss=0.3576, pruned_loss=0.1262, over 21642.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3701, pruned_loss=0.138, over 4287911.18 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 11:20:21,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-18 11:20:49,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=150060.0, ans=0.125 2023-06-18 11:21:30,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=150120.0, ans=0.0 2023-06-18 11:21:30,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-18 11:21:31,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=150120.0, ans=0.125 2023-06-18 11:21:50,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.123e+02 3.491e+02 4.208e+02 6.099e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-18 11:22:25,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=150240.0, ans=0.125 2023-06-18 11:22:50,200 INFO [train.py:996] (0/4) Epoch 1, batch 25050, loss[loss=0.2771, simple_loss=0.3176, pruned_loss=0.1183, over 21733.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3627, pruned_loss=0.1355, over 4273928.78 frames. ], batch size: 124, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:23:36,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=22.5 2023-06-18 11:24:18,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=150420.0, ans=8.0 2023-06-18 11:24:41,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=150480.0, ans=0.125 2023-06-18 11:24:51,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150540.0, ans=0.1 2023-06-18 11:24:58,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=150540.0, ans=0.2 2023-06-18 11:25:13,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150600.0, ans=0.1 2023-06-18 11:25:21,819 INFO [train.py:996] (0/4) Epoch 1, batch 25100, loss[loss=0.2984, simple_loss=0.3813, pruned_loss=0.1078, over 21689.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3564, pruned_loss=0.1329, over 4276878.13 frames. ], batch size: 332, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:27:00,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.018e+02 3.792e+02 4.723e+02 7.366e+02, threshold=7.583e+02, percent-clipped=2.0 2023-06-18 11:27:54,108 INFO [train.py:996] (0/4) Epoch 1, batch 25150, loss[loss=0.3707, simple_loss=0.4305, pruned_loss=0.1554, over 21459.00 frames. ], tot_loss[loss=0.3081, simple_loss=0.3576, pruned_loss=0.1293, over 4267502.59 frames. ], batch size: 471, lr: 2.41e-02, grad_scale: 16.0 2023-06-18 11:27:58,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=150900.0, ans=0.125 2023-06-18 11:28:04,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=150900.0, ans=0.09899494936611666 2023-06-18 11:28:05,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=15.0 2023-06-18 11:28:06,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=150900.0, ans=0.1 2023-06-18 11:28:06,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=150900.0, ans=0.0 2023-06-18 11:28:22,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150960.0, ans=0.1 2023-06-18 11:29:07,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=151080.0, ans=0.0 2023-06-18 11:29:52,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=151140.0, ans=0.125 2023-06-18 11:29:55,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=151140.0, ans=0.2 2023-06-18 11:30:17,536 INFO [train.py:996] (0/4) Epoch 1, batch 25200, loss[loss=0.2772, simple_loss=0.3475, pruned_loss=0.1034, over 21556.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3553, pruned_loss=0.1258, over 4269283.35 frames. ], batch size: 230, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:30:22,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=21.54 vs. limit=15.0 2023-06-18 11:30:32,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151260.0, ans=0.125 2023-06-18 11:30:36,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151260.0, ans=0.1 2023-06-18 11:31:09,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=151260.0, ans=0.2 2023-06-18 11:31:20,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151320.0, ans=0.125 2023-06-18 11:31:54,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151380.0, ans=0.1 2023-06-18 11:31:55,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.820e+02 3.433e+02 4.407e+02 9.386e+02, threshold=6.866e+02, percent-clipped=4.0 2023-06-18 11:32:40,301 INFO [train.py:996] (0/4) Epoch 1, batch 25250, loss[loss=0.2657, simple_loss=0.3234, pruned_loss=0.104, over 21680.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3516, pruned_loss=0.1222, over 4267624.19 frames. ], batch size: 298, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:32:48,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=151500.0, ans=0.0 2023-06-18 11:33:11,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151560.0, ans=0.1 2023-06-18 11:33:11,347 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.937e-03 2023-06-18 11:33:57,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=151620.0, ans=0.125 2023-06-18 11:34:05,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151680.0, ans=0.1 2023-06-18 11:34:23,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=151740.0, ans=0.0 2023-06-18 11:34:34,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.77 vs. limit=5.0 2023-06-18 11:34:43,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=151740.0, ans=0.0 2023-06-18 11:35:17,340 INFO [train.py:996] (0/4) Epoch 1, batch 25300, loss[loss=0.2889, simple_loss=0.348, pruned_loss=0.1148, over 21633.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.348, pruned_loss=0.1217, over 4262097.86 frames. ], batch size: 263, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:35:47,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=151860.0, ans=0.0 2023-06-18 11:36:12,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=151920.0, ans=0.0 2023-06-18 11:36:20,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=151920.0, ans=0.125 2023-06-18 11:36:45,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=151980.0, ans=0.0 2023-06-18 11:37:01,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.038e+02 3.574e+02 4.341e+02 5.884e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-18 11:37:36,657 INFO [train.py:996] (0/4) Epoch 1, batch 25350, loss[loss=0.2898, simple_loss=0.3497, pruned_loss=0.115, over 21523.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3509, pruned_loss=0.122, over 4258236.75 frames. ], batch size: 389, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 11:37:54,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=152100.0, ans=0.0 2023-06-18 11:38:17,114 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:39:02,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=152220.0, ans=0.0 2023-06-18 11:39:02,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=152220.0, ans=0.125 2023-06-18 11:39:04,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-18 11:39:49,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=152340.0, ans=0.0 2023-06-18 11:40:10,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-18 11:40:10,538 INFO [train.py:996] (0/4) Epoch 1, batch 25400, loss[loss=0.2705, simple_loss=0.3436, pruned_loss=0.0987, over 21594.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.349, pruned_loss=0.1218, over 4257127.01 frames. ], batch size: 441, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:41:58,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.135e+02 3.772e+02 5.152e+02 8.447e+02, threshold=7.545e+02, percent-clipped=8.0 2023-06-18 11:42:12,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-18 11:42:48,243 INFO [train.py:996] (0/4) Epoch 1, batch 25450, loss[loss=0.2926, simple_loss=0.3395, pruned_loss=0.1228, over 21814.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3497, pruned_loss=0.1235, over 4252369.42 frames. ], batch size: 282, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:42:48,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152700.0, ans=0.1 2023-06-18 11:44:03,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=152820.0, ans=0.2 2023-06-18 11:44:04,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2023-06-18 11:44:58,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-18 11:45:17,145 INFO [train.py:996] (0/4) Epoch 1, batch 25500, loss[loss=0.3209, simple_loss=0.3849, pruned_loss=0.1284, over 21648.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3517, pruned_loss=0.1218, over 4263291.80 frames. ], batch size: 389, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:46:42,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153120.0, ans=0.0 2023-06-18 11:47:14,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.935e+02 3.590e+02 4.450e+02 7.910e+02, threshold=7.180e+02, percent-clipped=1.0 2023-06-18 11:47:16,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153180.0, ans=0.0 2023-06-18 11:48:00,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=153240.0, ans=0.125 2023-06-18 11:48:09,539 INFO [train.py:996] (0/4) Epoch 1, batch 25550, loss[loss=0.2917, simple_loss=0.3753, pruned_loss=0.1041, over 21633.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3586, pruned_loss=0.122, over 4271019.82 frames. ], batch size: 389, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:50:19,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=153480.0, ans=0.2 2023-06-18 11:50:41,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153540.0, ans=0.1 2023-06-18 11:51:05,017 INFO [train.py:996] (0/4) Epoch 1, batch 25600, loss[loss=0.3914, simple_loss=0.4214, pruned_loss=0.1807, over 21824.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3639, pruned_loss=0.1232, over 4272834.23 frames. ], batch size: 441, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 11:51:33,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-18 11:52:54,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=153780.0, ans=10.0 2023-06-18 11:52:57,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 3.009e+02 3.848e+02 5.961e+02 1.110e+03, threshold=7.697e+02, percent-clipped=15.0 2023-06-18 11:53:02,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=153780.0, ans=0.5 2023-06-18 11:53:14,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153840.0, ans=0.1 2023-06-18 11:53:16,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-18 11:53:35,685 INFO [train.py:996] (0/4) Epoch 1, batch 25650, loss[loss=0.2817, simple_loss=0.3262, pruned_loss=0.1186, over 21717.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3655, pruned_loss=0.1269, over 4265004.00 frames. ], batch size: 300, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:55:10,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=154020.0, ans=0.125 2023-06-18 11:55:18,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-18 11:55:30,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154080.0, ans=0.125 2023-06-18 11:56:09,188 INFO [train.py:996] (0/4) Epoch 1, batch 25700, loss[loss=0.418, simple_loss=0.4895, pruned_loss=0.1732, over 19771.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3625, pruned_loss=0.1278, over 4269151.49 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:56:49,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=154260.0, ans=0.2 2023-06-18 11:57:41,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154320.0, ans=0.1 2023-06-18 11:58:10,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 3.158e+02 3.682e+02 4.301e+02 9.092e+02, threshold=7.363e+02, percent-clipped=1.0 2023-06-18 11:58:33,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154440.0, ans=0.1 2023-06-18 11:59:00,241 INFO [train.py:996] (0/4) Epoch 1, batch 25750, loss[loss=0.3531, simple_loss=0.4016, pruned_loss=0.1523, over 21452.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3675, pruned_loss=0.132, over 4271032.95 frames. ], batch size: 131, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 11:59:02,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=154500.0, ans=0.0 2023-06-18 11:59:23,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=154500.0, ans=0.125 2023-06-18 12:01:57,093 INFO [train.py:996] (0/4) Epoch 1, batch 25800, loss[loss=0.3794, simple_loss=0.4329, pruned_loss=0.1629, over 21379.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.38, pruned_loss=0.1378, over 4271076.62 frames. ], batch size: 131, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 12:02:45,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=154860.0, ans=0.0 2023-06-18 12:03:56,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.535e+02 4.132e+02 5.213e+02 8.329e+02, threshold=8.265e+02, percent-clipped=2.0 2023-06-18 12:04:30,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=155040.0, ans=0.2 2023-06-18 12:04:42,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155040.0, ans=0.125 2023-06-18 12:05:02,198 INFO [train.py:996] (0/4) Epoch 1, batch 25850, loss[loss=0.2811, simple_loss=0.3306, pruned_loss=0.1158, over 21834.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.3828, pruned_loss=0.1375, over 4270323.67 frames. ], batch size: 247, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 12:05:16,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=155160.0, ans=0.025 2023-06-18 12:05:25,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=155160.0, ans=0.95 2023-06-18 12:06:05,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155220.0, ans=0.1 2023-06-18 12:06:43,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=155280.0, ans=0.125 2023-06-18 12:07:06,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=155340.0, ans=0.125 2023-06-18 12:07:12,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155340.0, ans=0.125 2023-06-18 12:07:30,278 INFO [train.py:996] (0/4) Epoch 1, batch 25900, loss[loss=0.3718, simple_loss=0.3999, pruned_loss=0.1719, over 20030.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3839, pruned_loss=0.1383, over 4274522.91 frames. ], batch size: 702, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:07:58,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155400.0, ans=0.1 2023-06-18 12:08:09,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=155460.0, ans=0.125 2023-06-18 12:08:15,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155460.0, ans=0.1 2023-06-18 12:09:12,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155520.0, ans=0.1 2023-06-18 12:09:35,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155580.0, ans=0.125 2023-06-18 12:09:39,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 3.281e+02 3.976e+02 5.347e+02 9.829e+02, threshold=7.952e+02, percent-clipped=3.0 2023-06-18 12:09:52,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=155640.0, ans=0.0 2023-06-18 12:10:18,125 INFO [train.py:996] (0/4) Epoch 1, batch 25950, loss[loss=0.3459, simple_loss=0.3925, pruned_loss=0.1497, over 21616.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3894, pruned_loss=0.141, over 4271570.52 frames. ], batch size: 263, lr: 2.37e-02, grad_scale: 16.0 2023-06-18 12:10:19,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=8.0 2023-06-18 12:11:38,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155820.0, ans=0.125 2023-06-18 12:11:57,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-18 12:12:30,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=155940.0, ans=0.0 2023-06-18 12:12:33,071 INFO [train.py:996] (0/4) Epoch 1, batch 26000, loss[loss=0.467, simple_loss=0.4851, pruned_loss=0.2245, over 21409.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3894, pruned_loss=0.1394, over 4273262.27 frames. ], batch size: 509, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:12:56,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=156000.0, ans=0.2 2023-06-18 12:13:59,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=156060.0, ans=0.125 2023-06-18 12:14:18,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=156120.0, ans=0.125 2023-06-18 12:14:31,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156180.0, ans=0.1 2023-06-18 12:14:44,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.144e+02 3.654e+02 4.482e+02 8.156e+02, threshold=7.307e+02, percent-clipped=1.0 2023-06-18 12:15:24,698 INFO [train.py:996] (0/4) Epoch 1, batch 26050, loss[loss=0.2843, simple_loss=0.4077, pruned_loss=0.08046, over 19908.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3888, pruned_loss=0.1402, over 4278421.15 frames. ], batch size: 702, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 12:17:20,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=156480.0, ans=0.0 2023-06-18 12:18:05,268 INFO [train.py:996] (0/4) Epoch 1, batch 26100, loss[loss=0.2849, simple_loss=0.3302, pruned_loss=0.1198, over 21890.00 frames. ], tot_loss[loss=0.3304, simple_loss=0.3835, pruned_loss=0.1386, over 4280829.62 frames. ], batch size: 298, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:18:46,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-18 12:19:06,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=156660.0, ans=0.125 2023-06-18 12:19:33,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=156780.0, ans=0.0 2023-06-18 12:19:41,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.360e+02 3.963e+02 4.876e+02 8.517e+02, threshold=7.926e+02, percent-clipped=3.0 2023-06-18 12:20:22,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156840.0, ans=0.1 2023-06-18 12:20:46,671 INFO [train.py:996] (0/4) Epoch 1, batch 26150, loss[loss=0.3517, simple_loss=0.3984, pruned_loss=0.1525, over 21314.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3802, pruned_loss=0.1385, over 4290141.00 frames. ], batch size: 159, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:21:38,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=156960.0, ans=0.0 2023-06-18 12:22:19,482 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:22:26,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=15.0 2023-06-18 12:22:53,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=157080.0, ans=0.2 2023-06-18 12:23:23,577 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:23:26,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=157140.0, ans=0.125 2023-06-18 12:23:30,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=157140.0, ans=0.0 2023-06-18 12:23:34,852 INFO [train.py:996] (0/4) Epoch 1, batch 26200, loss[loss=0.2732, simple_loss=0.3623, pruned_loss=0.09207, over 21449.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3791, pruned_loss=0.1357, over 4287498.99 frames. ], batch size: 211, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:24:13,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=22.5 2023-06-18 12:24:34,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=157260.0, ans=0.2 2023-06-18 12:24:48,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=157320.0, ans=6.0 2023-06-18 12:25:07,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=157380.0, ans=0.125 2023-06-18 12:25:21,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.203e+02 3.929e+02 5.143e+02 8.013e+02, threshold=7.858e+02, percent-clipped=1.0 2023-06-18 12:26:04,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157440.0, ans=0.125 2023-06-18 12:26:24,268 INFO [train.py:996] (0/4) Epoch 1, batch 26250, loss[loss=0.2927, simple_loss=0.3618, pruned_loss=0.1119, over 21163.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3817, pruned_loss=0.133, over 4284109.52 frames. ], batch size: 608, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:27:13,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=157560.0, ans=0.125 2023-06-18 12:27:18,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=157560.0, ans=0.125 2023-06-18 12:27:19,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=157560.0, ans=0.125 2023-06-18 12:27:30,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=157620.0, ans=0.125 2023-06-18 12:28:52,332 INFO [train.py:996] (0/4) Epoch 1, batch 26300, loss[loss=0.3127, simple_loss=0.3592, pruned_loss=0.1331, over 21775.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3787, pruned_loss=0.1348, over 4290547.70 frames. ], batch size: 112, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 12:29:18,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=157800.0, ans=0.2 2023-06-18 12:29:18,184 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:29:39,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=157860.0, ans=0.2 2023-06-18 12:30:12,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=157920.0, ans=0.125 2023-06-18 12:30:46,901 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 3.042e+02 3.851e+02 4.638e+02 8.463e+02, threshold=7.702e+02, percent-clipped=2.0 2023-06-18 12:31:06,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=157980.0, ans=0.0 2023-06-18 12:31:45,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=158040.0, ans=0.95 2023-06-18 12:31:47,944 INFO [train.py:996] (0/4) Epoch 1, batch 26350, loss[loss=0.3536, simple_loss=0.3832, pruned_loss=0.1621, over 20030.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3759, pruned_loss=0.1353, over 4281354.20 frames. ], batch size: 702, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:32:27,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=158160.0, ans=0.125 2023-06-18 12:32:46,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=158160.0, ans=0.5 2023-06-18 12:33:54,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=158340.0, ans=0.2 2023-06-18 12:34:19,820 INFO [train.py:996] (0/4) Epoch 1, batch 26400, loss[loss=0.3217, simple_loss=0.3605, pruned_loss=0.1415, over 21775.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3706, pruned_loss=0.1352, over 4282169.14 frames. ], batch size: 98, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:34:48,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=158460.0, ans=0.025 2023-06-18 12:36:06,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.087e+02 3.915e+02 4.820e+02 6.818e+02, threshold=7.829e+02, percent-clipped=0.0 2023-06-18 12:36:54,710 INFO [train.py:996] (0/4) Epoch 1, batch 26450, loss[loss=0.3598, simple_loss=0.4354, pruned_loss=0.1421, over 21865.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.371, pruned_loss=0.1349, over 4277394.64 frames. ], batch size: 372, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:37:26,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=158700.0, ans=0.0 2023-06-18 12:38:18,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-18 12:38:27,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=158820.0, ans=0.0 2023-06-18 12:38:30,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=158820.0, ans=0.125 2023-06-18 12:39:19,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=158940.0, ans=0.125 2023-06-18 12:40:01,465 INFO [train.py:996] (0/4) Epoch 1, batch 26500, loss[loss=0.3075, simple_loss=0.3749, pruned_loss=0.12, over 21735.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3711, pruned_loss=0.1326, over 4276543.07 frames. ], batch size: 351, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:40:44,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=159060.0, ans=0.125 2023-06-18 12:40:44,936 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:41:57,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.526e+02 4.481e+02 5.627e+02 1.219e+03, threshold=8.961e+02, percent-clipped=9.0 2023-06-18 12:41:58,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=159180.0, ans=0.125 2023-06-18 12:42:23,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=159180.0, ans=0.07 2023-06-18 12:42:42,864 INFO [train.py:996] (0/4) Epoch 1, batch 26550, loss[loss=0.2621, simple_loss=0.3449, pruned_loss=0.0897, over 21742.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3667, pruned_loss=0.1281, over 4268468.79 frames. ], batch size: 332, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 12:45:16,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-18 12:45:39,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=159540.0, ans=0.04949747468305833 2023-06-18 12:45:44,645 INFO [train.py:996] (0/4) Epoch 1, batch 26600, loss[loss=0.2883, simple_loss=0.3413, pruned_loss=0.1177, over 21192.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3654, pruned_loss=0.1246, over 4263724.13 frames. ], batch size: 176, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:45:46,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=159600.0, ans=0.0 2023-06-18 12:46:01,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159600.0, ans=0.125 2023-06-18 12:47:42,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.824e+02 3.393e+02 4.003e+02 5.713e+02, threshold=6.786e+02, percent-clipped=0.0 2023-06-18 12:48:16,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=159840.0, ans=0.125 2023-06-18 12:48:22,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=159840.0, ans=0.2 2023-06-18 12:48:24,328 INFO [train.py:996] (0/4) Epoch 1, batch 26650, loss[loss=0.2781, simple_loss=0.3217, pruned_loss=0.1173, over 21345.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3594, pruned_loss=0.1244, over 4257386.56 frames. ], batch size: 194, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:48:45,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159900.0, ans=0.1 2023-06-18 12:49:36,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=160020.0, ans=0.125 2023-06-18 12:49:41,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=160020.0, ans=0.0 2023-06-18 12:50:54,362 INFO [train.py:996] (0/4) Epoch 1, batch 26700, loss[loss=0.3581, simple_loss=0.3884, pruned_loss=0.1639, over 21803.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3512, pruned_loss=0.1203, over 4262835.54 frames. ], batch size: 441, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:51:07,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=160200.0, ans=0.125 2023-06-18 12:51:18,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=160260.0, ans=10.0 2023-06-18 12:52:28,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=160380.0, ans=0.125 2023-06-18 12:52:36,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.796e+02 3.328e+02 4.126e+02 7.819e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-18 12:52:46,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=160440.0, ans=0.0 2023-06-18 12:53:35,174 INFO [train.py:996] (0/4) Epoch 1, batch 26750, loss[loss=0.3142, simple_loss=0.3687, pruned_loss=0.1298, over 21720.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3504, pruned_loss=0.1183, over 4269248.50 frames. ], batch size: 332, lr: 2.34e-02, grad_scale: 16.0 2023-06-18 12:53:42,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=160500.0, ans=0.125 2023-06-18 12:54:08,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-18 12:55:12,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=160680.0, ans=0.125 2023-06-18 12:55:25,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-18 12:55:45,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160740.0, ans=0.125 2023-06-18 12:56:16,278 INFO [train.py:996] (0/4) Epoch 1, batch 26800, loss[loss=0.2905, simple_loss=0.3428, pruned_loss=0.1191, over 21923.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3604, pruned_loss=0.1246, over 4262985.86 frames. ], batch size: 98, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 12:56:46,332 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:57:30,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=22.5 2023-06-18 12:57:57,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=160980.0, ans=0.0 2023-06-18 12:58:01,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.459e+02 3.380e+02 3.968e+02 4.895e+02 8.004e+02, threshold=7.935e+02, percent-clipped=7.0 2023-06-18 12:58:46,430 INFO [train.py:996] (0/4) Epoch 1, batch 26850, loss[loss=0.2766, simple_loss=0.3183, pruned_loss=0.1175, over 21700.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.364, pruned_loss=0.1294, over 4263438.48 frames. ], batch size: 333, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 12:58:52,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=161100.0, ans=0.0 2023-06-18 12:59:18,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-18 12:59:25,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161160.0, ans=0.1 2023-06-18 13:00:12,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.77 vs. limit=22.5 2023-06-18 13:00:53,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-18 13:01:17,654 INFO [train.py:996] (0/4) Epoch 1, batch 26900, loss[loss=0.2863, simple_loss=0.3258, pruned_loss=0.1234, over 21656.00 frames. ], tot_loss[loss=0.3045, simple_loss=0.3545, pruned_loss=0.1273, over 4256977.32 frames. ], batch size: 333, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 13:01:55,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-18 13:01:59,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=161460.0, ans=0.125 2023-06-18 13:02:55,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=161580.0, ans=0.0 2023-06-18 13:02:55,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161580.0, ans=0.125 2023-06-18 13:02:58,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.755e+02 3.329e+02 3.765e+02 8.557e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-18 13:03:11,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=161640.0, ans=0.0 2023-06-18 13:03:27,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=161640.0, ans=0.0 2023-06-18 13:03:35,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-18 13:03:37,562 INFO [train.py:996] (0/4) Epoch 1, batch 26950, loss[loss=0.3287, simple_loss=0.3949, pruned_loss=0.1313, over 21647.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3539, pruned_loss=0.1273, over 4254753.88 frames. ], batch size: 263, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:04:23,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161760.0, ans=0.1 2023-06-18 13:05:30,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-18 13:05:48,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=161940.0, ans=0.02 2023-06-18 13:05:58,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=161940.0, ans=0.2 2023-06-18 13:05:58,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=161940.0, ans=0.125 2023-06-18 13:06:17,428 INFO [train.py:996] (0/4) Epoch 1, batch 27000, loss[loss=0.2618, simple_loss=0.3435, pruned_loss=0.09004, over 21654.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3533, pruned_loss=0.1239, over 4250479.76 frames. ], batch size: 298, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:06:17,429 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 13:06:56,174 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.2848, simple_loss=0.3741, pruned_loss=0.09774, over 1796401.00 frames. 2023-06-18 13:06:56,175 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 13:07:04,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=162000.0, ans=0.2 2023-06-18 13:07:11,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=162060.0, ans=0.125 2023-06-18 13:08:26,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.842e+02 3.323e+02 4.144e+02 6.549e+02, threshold=6.646e+02, percent-clipped=0.0 2023-06-18 13:09:19,653 INFO [train.py:996] (0/4) Epoch 1, batch 27050, loss[loss=0.2861, simple_loss=0.3462, pruned_loss=0.113, over 21813.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3531, pruned_loss=0.1186, over 4253422.16 frames. ], batch size: 247, lr: 2.33e-02, grad_scale: 16.0 2023-06-18 13:11:06,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=162480.0, ans=0.125 2023-06-18 13:11:15,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=162480.0, ans=0.125 2023-06-18 13:11:30,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=162540.0, ans=0.07 2023-06-18 13:11:34,183 INFO [train.py:996] (0/4) Epoch 1, batch 27100, loss[loss=0.2847, simple_loss=0.3679, pruned_loss=0.1007, over 21826.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3569, pruned_loss=0.1227, over 4263015.75 frames. ], batch size: 282, lr: 2.32e-02, grad_scale: 16.0 2023-06-18 13:13:40,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=162780.0, ans=0.125 2023-06-18 13:13:45,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.912e+02 3.397e+02 4.397e+02 9.217e+02, threshold=6.794e+02, percent-clipped=5.0 2023-06-18 13:14:30,548 INFO [train.py:996] (0/4) Epoch 1, batch 27150, loss[loss=0.3676, simple_loss=0.4309, pruned_loss=0.1521, over 21821.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.37, pruned_loss=0.1274, over 4266550.22 frames. ], batch size: 371, lr: 2.32e-02, grad_scale: 16.0 2023-06-18 13:15:14,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=162960.0, ans=0.125 2023-06-18 13:16:02,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=163020.0, ans=0.125 2023-06-18 13:16:05,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=163020.0, ans=0.0 2023-06-18 13:16:19,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=163080.0, ans=0.05 2023-06-18 13:17:21,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=163140.0, ans=0.0 2023-06-18 13:17:33,478 INFO [train.py:996] (0/4) Epoch 1, batch 27200, loss[loss=0.3265, simple_loss=0.3796, pruned_loss=0.1367, over 21359.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3767, pruned_loss=0.129, over 4271956.95 frames. ], batch size: 176, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:17:55,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2023-06-18 13:18:56,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=163320.0, ans=0.125 2023-06-18 13:19:09,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=163380.0, ans=0.125 2023-06-18 13:19:30,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 3.450e+02 3.753e+02 4.440e+02 7.540e+02, threshold=7.506e+02, percent-clipped=2.0 2023-06-18 13:19:48,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=163440.0, ans=0.125 2023-06-18 13:20:13,111 INFO [train.py:996] (0/4) Epoch 1, batch 27250, loss[loss=0.3306, simple_loss=0.3723, pruned_loss=0.1444, over 20605.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3822, pruned_loss=0.1357, over 4271932.47 frames. ], batch size: 607, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:20:48,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=163560.0, ans=0.0 2023-06-18 13:21:39,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=163620.0, ans=0.07 2023-06-18 13:21:39,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163620.0, ans=0.125 2023-06-18 13:22:04,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-18 13:23:10,378 INFO [train.py:996] (0/4) Epoch 1, batch 27300, loss[loss=0.3839, simple_loss=0.4338, pruned_loss=0.167, over 21731.00 frames. ], tot_loss[loss=0.3305, simple_loss=0.3853, pruned_loss=0.1379, over 4275847.30 frames. ], batch size: 441, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 13:23:18,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=163800.0, ans=0.025 2023-06-18 13:24:30,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163920.0, ans=0.1 2023-06-18 13:25:12,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.700e+02 4.629e+02 5.941e+02 1.159e+03, threshold=9.258e+02, percent-clipped=7.0 2023-06-18 13:25:52,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=164040.0, ans=0.2 2023-06-18 13:26:07,925 INFO [train.py:996] (0/4) Epoch 1, batch 27350, loss[loss=0.3581, simple_loss=0.4662, pruned_loss=0.125, over 19827.00 frames. ], tot_loss[loss=0.334, simple_loss=0.3888, pruned_loss=0.1396, over 4266512.88 frames. ], batch size: 703, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:26:40,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2023-06-18 13:27:07,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-18 13:27:52,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164280.0, ans=0.1 2023-06-18 13:27:52,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=164280.0, ans=0.2 2023-06-18 13:28:06,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=164340.0, ans=0.0 2023-06-18 13:28:34,423 INFO [train.py:996] (0/4) Epoch 1, batch 27400, loss[loss=0.2975, simple_loss=0.3401, pruned_loss=0.1275, over 21260.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3824, pruned_loss=0.1375, over 4271764.19 frames. ], batch size: 176, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:29:18,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-18 13:29:48,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=164520.0, ans=0.125 2023-06-18 13:29:48,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=164520.0, ans=0.0 2023-06-18 13:30:10,585 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:30:24,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=164580.0, ans=0.0 2023-06-18 13:30:34,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.849e+02 3.387e+02 3.962e+02 7.059e+02, threshold=6.774e+02, percent-clipped=0.0 2023-06-18 13:30:53,894 INFO [train.py:996] (0/4) Epoch 1, batch 27450, loss[loss=0.3177, simple_loss=0.382, pruned_loss=0.1266, over 21301.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3754, pruned_loss=0.1348, over 4267021.55 frames. ], batch size: 548, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:31:32,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=164760.0, ans=0.125 2023-06-18 13:31:50,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=164760.0, ans=0.2 2023-06-18 13:32:23,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-18 13:32:47,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=164880.0, ans=0.0 2023-06-18 13:33:34,192 INFO [train.py:996] (0/4) Epoch 1, batch 27500, loss[loss=0.301, simple_loss=0.3464, pruned_loss=0.1278, over 21865.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3742, pruned_loss=0.1352, over 4270230.14 frames. ], batch size: 282, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:35:20,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=165180.0, ans=0.0 2023-06-18 13:35:23,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-18 13:35:25,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.982e+02 3.278e+02 3.839e+02 6.377e+02, threshold=6.555e+02, percent-clipped=0.0 2023-06-18 13:36:05,222 INFO [train.py:996] (0/4) Epoch 1, batch 27550, loss[loss=0.3013, simple_loss=0.3498, pruned_loss=0.1264, over 21819.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3702, pruned_loss=0.132, over 4274661.76 frames. ], batch size: 98, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 13:37:47,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=165480.0, ans=0.0 2023-06-18 13:37:54,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=165540.0, ans=0.125 2023-06-18 13:38:05,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=165540.0, ans=0.125 2023-06-18 13:38:15,287 INFO [train.py:996] (0/4) Epoch 1, batch 27600, loss[loss=0.2784, simple_loss=0.325, pruned_loss=0.1159, over 21748.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3619, pruned_loss=0.1294, over 4280318.97 frames. ], batch size: 351, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:38:25,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=165600.0, ans=0.05 2023-06-18 13:38:42,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=165660.0, ans=0.0 2023-06-18 13:38:43,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=165660.0, ans=0.09899494936611666 2023-06-18 13:39:04,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165660.0, ans=0.125 2023-06-18 13:39:06,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-18 13:40:02,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.185e+02 3.559e+02 4.117e+02 6.813e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-18 13:40:24,894 INFO [train.py:996] (0/4) Epoch 1, batch 27650, loss[loss=0.2852, simple_loss=0.3381, pruned_loss=0.1161, over 21365.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3551, pruned_loss=0.1279, over 4274275.55 frames. ], batch size: 144, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:40:56,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-18 13:42:05,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=166080.0, ans=0.2 2023-06-18 13:42:09,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=12.0 2023-06-18 13:42:49,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166140.0, ans=0.1 2023-06-18 13:43:02,028 INFO [train.py:996] (0/4) Epoch 1, batch 27700, loss[loss=0.2539, simple_loss=0.34, pruned_loss=0.08393, over 20952.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3534, pruned_loss=0.1242, over 4270252.15 frames. ], batch size: 608, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:43:35,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-18 13:43:52,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=166260.0, ans=0.125 2023-06-18 13:44:20,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=166320.0, ans=0.2 2023-06-18 13:44:43,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-18 13:45:09,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.145e+02 3.672e+02 4.225e+02 7.825e+02, threshold=7.344e+02, percent-clipped=2.0 2023-06-18 13:45:09,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166380.0, ans=0.1 2023-06-18 13:45:31,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=166440.0, ans=0.125 2023-06-18 13:45:32,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=166440.0, ans=0.0 2023-06-18 13:45:42,143 INFO [train.py:996] (0/4) Epoch 1, batch 27750, loss[loss=0.2388, simple_loss=0.3095, pruned_loss=0.0841, over 21403.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3556, pruned_loss=0.1221, over 4272569.97 frames. ], batch size: 211, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:46:36,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=166560.0, ans=0.125 2023-06-18 13:46:42,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=166560.0, ans=0.125 2023-06-18 13:48:22,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=166800.0, ans=0.2 2023-06-18 13:48:23,260 INFO [train.py:996] (0/4) Epoch 1, batch 27800, loss[loss=0.3045, simple_loss=0.344, pruned_loss=0.1325, over 21539.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3526, pruned_loss=0.1218, over 4279990.63 frames. ], batch size: 212, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 13:48:26,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=166800.0, ans=0.125 2023-06-18 13:48:43,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=166800.0, ans=0.0 2023-06-18 13:49:56,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=166980.0, ans=0.125 2023-06-18 13:49:59,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166980.0, ans=0.0 2023-06-18 13:50:08,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=166980.0, ans=0.0 2023-06-18 13:50:11,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 3.084e+02 3.621e+02 4.481e+02 7.199e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 13:50:45,599 INFO [train.py:996] (0/4) Epoch 1, batch 27850, loss[loss=0.3117, simple_loss=0.3481, pruned_loss=0.1376, over 21568.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3521, pruned_loss=0.1236, over 4289058.47 frames. ], batch size: 548, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:50:46,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=167100.0, ans=0.125 2023-06-18 13:52:07,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167220.0, ans=0.1 2023-06-18 13:53:10,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=167340.0, ans=0.015 2023-06-18 13:53:43,406 INFO [train.py:996] (0/4) Epoch 1, batch 27900, loss[loss=0.3259, simple_loss=0.3871, pruned_loss=0.1324, over 19921.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.362, pruned_loss=0.1258, over 4281241.77 frames. ], batch size: 702, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:54:27,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167460.0, ans=0.1 2023-06-18 13:54:27,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=167460.0, ans=0.0 2023-06-18 13:55:36,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=167580.0, ans=0.125 2023-06-18 13:55:58,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.095e+02 3.904e+02 5.239e+02 9.245e+02, threshold=7.808e+02, percent-clipped=6.0 2023-06-18 13:56:23,312 INFO [train.py:996] (0/4) Epoch 1, batch 27950, loss[loss=0.2882, simple_loss=0.3666, pruned_loss=0.1049, over 21717.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3633, pruned_loss=0.1218, over 4277557.51 frames. ], batch size: 351, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:57:24,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167760.0, ans=0.1 2023-06-18 13:57:41,077 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:57:41,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-18 13:57:55,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=167820.0, ans=0.125 2023-06-18 13:57:59,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-18 13:58:41,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=22.5 2023-06-18 13:59:18,043 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-28000.pt 2023-06-18 13:59:22,420 INFO [train.py:996] (0/4) Epoch 1, batch 28000, loss[loss=0.2873, simple_loss=0.3343, pruned_loss=0.1202, over 21822.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3612, pruned_loss=0.1206, over 4277977.70 frames. ], batch size: 247, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 13:59:38,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=168000.0, ans=0.125 2023-06-18 14:01:10,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.776e+02 3.408e+02 4.603e+02 6.959e+02, threshold=6.817e+02, percent-clipped=0.0 2023-06-18 14:01:21,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=168240.0, ans=0.015 2023-06-18 14:01:42,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=168300.0, ans=0.125 2023-06-18 14:01:43,200 INFO [train.py:996] (0/4) Epoch 1, batch 28050, loss[loss=0.3081, simple_loss=0.3582, pruned_loss=0.129, over 21187.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3587, pruned_loss=0.1222, over 4281936.11 frames. ], batch size: 607, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:02:44,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-18 14:02:46,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=168360.0, ans=0.125 2023-06-18 14:03:25,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=168480.0, ans=0.0 2023-06-18 14:03:47,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-18 14:03:49,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-18 14:03:51,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=168480.0, ans=0.04949747468305833 2023-06-18 14:04:24,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=168540.0, ans=0.0 2023-06-18 14:04:31,084 INFO [train.py:996] (0/4) Epoch 1, batch 28100, loss[loss=0.265, simple_loss=0.3102, pruned_loss=0.1099, over 21483.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3575, pruned_loss=0.1221, over 4275957.95 frames. ], batch size: 195, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 14:05:11,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168660.0, ans=0.1 2023-06-18 14:06:12,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.173e+02 3.819e+02 4.992e+02 1.067e+03, threshold=7.638e+02, percent-clipped=9.0 2023-06-18 14:06:44,355 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:06:48,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=168840.0, ans=0.125 2023-06-18 14:06:50,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=168840.0, ans=0.0 2023-06-18 14:06:59,258 INFO [train.py:996] (0/4) Epoch 1, batch 28150, loss[loss=0.2605, simple_loss=0.3041, pruned_loss=0.1084, over 21537.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3509, pruned_loss=0.1224, over 4279930.59 frames. ], batch size: 263, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:07:21,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=168960.0, ans=0.2 2023-06-18 14:07:43,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=168960.0, ans=0.05 2023-06-18 14:09:26,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=169140.0, ans=0.025 2023-06-18 14:09:29,050 INFO [train.py:996] (0/4) Epoch 1, batch 28200, loss[loss=0.3326, simple_loss=0.3671, pruned_loss=0.149, over 20697.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3492, pruned_loss=0.1244, over 4269843.65 frames. ], batch size: 607, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:09:38,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=169200.0, ans=0.1 2023-06-18 14:09:39,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=169200.0, ans=0.0 2023-06-18 14:09:41,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=169200.0, ans=0.125 2023-06-18 14:09:54,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=169200.0, ans=0.2 2023-06-18 14:09:56,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=169200.0, ans=0.0 2023-06-18 14:11:38,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.366e+02 3.817e+02 4.750e+02 7.946e+02, threshold=7.633e+02, percent-clipped=3.0 2023-06-18 14:12:07,344 INFO [train.py:996] (0/4) Epoch 1, batch 28250, loss[loss=0.3104, simple_loss=0.3634, pruned_loss=0.1287, over 21606.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3538, pruned_loss=0.1285, over 4263114.33 frames. ], batch size: 263, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:12:07,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=169500.0, ans=0.09899494936611666 2023-06-18 14:12:12,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.53 vs. limit=15.0 2023-06-18 14:12:23,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=169560.0, ans=0.035 2023-06-18 14:12:25,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=169560.0, ans=0.0 2023-06-18 14:14:03,898 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:14:23,363 INFO [train.py:996] (0/4) Epoch 1, batch 28300, loss[loss=0.2454, simple_loss=0.3457, pruned_loss=0.07249, over 20771.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3512, pruned_loss=0.1251, over 4264486.08 frames. ], batch size: 608, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:15:05,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=169860.0, ans=0.125 2023-06-18 14:16:01,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=169920.0, ans=0.2 2023-06-18 14:16:09,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=169920.0, ans=10.0 2023-06-18 14:16:42,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.780e+02 3.405e+02 4.225e+02 7.738e+02, threshold=6.811e+02, percent-clipped=1.0 2023-06-18 14:17:20,904 INFO [train.py:996] (0/4) Epoch 1, batch 28350, loss[loss=0.2776, simple_loss=0.3136, pruned_loss=0.1208, over 21850.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3477, pruned_loss=0.1175, over 4254092.41 frames. ], batch size: 107, lr: 2.28e-02, grad_scale: 32.0 2023-06-18 14:17:25,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=170100.0, ans=0.0 2023-06-18 14:18:39,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=170220.0, ans=0.125 2023-06-18 14:19:22,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=170340.0, ans=0.125 2023-06-18 14:19:32,445 INFO [train.py:996] (0/4) Epoch 1, batch 28400, loss[loss=0.3153, simple_loss=0.3393, pruned_loss=0.1457, over 21238.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3437, pruned_loss=0.1174, over 4248903.93 frames. ], batch size: 471, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:20:02,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=170400.0, ans=0.025 2023-06-18 14:20:03,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170400.0, ans=0.1 2023-06-18 14:20:05,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=170400.0, ans=0.125 2023-06-18 14:20:53,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=170520.0, ans=0.2 2023-06-18 14:20:56,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=170520.0, ans=0.125 2023-06-18 14:21:34,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.372e+02 4.044e+02 4.842e+02 8.605e+02, threshold=8.089e+02, percent-clipped=5.0 2023-06-18 14:22:19,301 INFO [train.py:996] (0/4) Epoch 1, batch 28450, loss[loss=0.2923, simple_loss=0.3378, pruned_loss=0.1234, over 21420.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3518, pruned_loss=0.1238, over 4253931.54 frames. ], batch size: 211, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:22:46,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=170700.0, ans=0.125 2023-06-18 14:22:46,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=170700.0, ans=0.125 2023-06-18 14:24:13,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=170880.0, ans=0.0 2023-06-18 14:24:35,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=170940.0, ans=0.0 2023-06-18 14:24:45,896 INFO [train.py:996] (0/4) Epoch 1, batch 28500, loss[loss=0.3277, simple_loss=0.3844, pruned_loss=0.1355, over 21327.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3545, pruned_loss=0.1264, over 4267290.62 frames. ], batch size: 159, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:25:13,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=171000.0, ans=0.0 2023-06-18 14:25:23,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171060.0, ans=0.1 2023-06-18 14:25:25,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=171060.0, ans=0.125 2023-06-18 14:26:05,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=171120.0, ans=0.125 2023-06-18 14:26:13,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=171120.0, ans=0.2 2023-06-18 14:26:24,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=171180.0, ans=0.2 2023-06-18 14:26:35,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=171180.0, ans=0.2 2023-06-18 14:26:37,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.096e+02 3.441e+02 4.526e+02 8.440e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-18 14:27:38,029 INFO [train.py:996] (0/4) Epoch 1, batch 28550, loss[loss=0.3423, simple_loss=0.4297, pruned_loss=0.1275, over 20728.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3638, pruned_loss=0.1303, over 4270099.05 frames. ], batch size: 607, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:27:56,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=171300.0, ans=0.125 2023-06-18 14:27:59,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=171300.0, ans=15.0 2023-06-18 14:28:02,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=171360.0, ans=0.125 2023-06-18 14:28:46,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=171420.0, ans=0.0 2023-06-18 14:29:03,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-18 14:29:24,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=171480.0, ans=0.125 2023-06-18 14:30:00,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=171540.0, ans=0.5 2023-06-18 14:30:17,191 INFO [train.py:996] (0/4) Epoch 1, batch 28600, loss[loss=0.3546, simple_loss=0.3992, pruned_loss=0.155, over 21577.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3706, pruned_loss=0.1326, over 4273807.99 frames. ], batch size: 414, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 14:32:15,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.104e+02 3.641e+02 4.688e+02 7.003e+02, threshold=7.282e+02, percent-clipped=1.0 2023-06-18 14:32:57,303 INFO [train.py:996] (0/4) Epoch 1, batch 28650, loss[loss=0.2756, simple_loss=0.3166, pruned_loss=0.1172, over 21532.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3645, pruned_loss=0.1314, over 4265540.03 frames. ], batch size: 263, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:33:14,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=171900.0, ans=0.125 2023-06-18 14:34:20,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-18 14:34:48,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=172140.0, ans=0.2 2023-06-18 14:35:14,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-18 14:35:28,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172140.0, ans=0.125 2023-06-18 14:35:35,833 INFO [train.py:996] (0/4) Epoch 1, batch 28700, loss[loss=0.3078, simple_loss=0.365, pruned_loss=0.1253, over 21826.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3641, pruned_loss=0.1332, over 4269023.23 frames. ], batch size: 282, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:35:36,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-18 14:36:30,337 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:36:47,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172320.0, ans=0.0 2023-06-18 14:36:47,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=172320.0, ans=0.125 2023-06-18 14:37:34,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.041e+02 3.554e+02 4.437e+02 9.213e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-18 14:38:08,595 INFO [train.py:996] (0/4) Epoch 1, batch 28750, loss[loss=0.2929, simple_loss=0.3762, pruned_loss=0.1048, over 20987.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3639, pruned_loss=0.133, over 4276317.21 frames. ], batch size: 607, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:38:43,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=172560.0, ans=0.125 2023-06-18 14:39:27,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=172620.0, ans=0.125 2023-06-18 14:39:27,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172620.0, ans=0.1 2023-06-18 14:39:29,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172620.0, ans=0.0 2023-06-18 14:40:46,664 INFO [train.py:996] (0/4) Epoch 1, batch 28800, loss[loss=0.3305, simple_loss=0.3818, pruned_loss=0.1396, over 21754.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3684, pruned_loss=0.1341, over 4280749.15 frames. ], batch size: 298, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:40:53,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172800.0, ans=0.0 2023-06-18 14:41:11,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172860.0, ans=0.1 2023-06-18 14:42:26,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-18 14:42:35,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=172980.0, ans=0.2 2023-06-18 14:42:39,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-18 14:42:41,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 3.537e+02 4.377e+02 5.475e+02 1.000e+03, threshold=8.754e+02, percent-clipped=7.0 2023-06-18 14:43:28,261 INFO [train.py:996] (0/4) Epoch 1, batch 28850, loss[loss=0.3308, simple_loss=0.401, pruned_loss=0.1303, over 19859.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3702, pruned_loss=0.1364, over 4283847.35 frames. ], batch size: 702, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:44:16,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=173160.0, ans=0.125 2023-06-18 14:45:12,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=173280.0, ans=0.125 2023-06-18 14:45:36,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-18 14:46:14,987 INFO [train.py:996] (0/4) Epoch 1, batch 28900, loss[loss=0.3537, simple_loss=0.3979, pruned_loss=0.1547, over 21872.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3729, pruned_loss=0.1382, over 4283147.42 frames. ], batch size: 371, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 14:46:23,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-18 14:47:01,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=173460.0, ans=0.0 2023-06-18 14:47:49,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=173520.0, ans=0.0 2023-06-18 14:48:15,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=173580.0, ans=0.125 2023-06-18 14:48:16,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.311e+02 3.872e+02 4.627e+02 1.015e+03, threshold=7.745e+02, percent-clipped=2.0 2023-06-18 14:49:06,885 INFO [train.py:996] (0/4) Epoch 1, batch 28950, loss[loss=0.4339, simple_loss=0.5125, pruned_loss=0.1777, over 19749.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.373, pruned_loss=0.1363, over 4277372.22 frames. ], batch size: 702, lr: 2.25e-02, grad_scale: 64.0 2023-06-18 14:49:10,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-18 14:50:30,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173820.0, ans=0.125 2023-06-18 14:50:30,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=173820.0, ans=0.125 2023-06-18 14:50:34,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=173820.0, ans=0.2 2023-06-18 14:50:34,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=173820.0, ans=0.02 2023-06-18 14:50:35,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=173820.0, ans=0.125 2023-06-18 14:50:39,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=173820.0, ans=0.025 2023-06-18 14:50:39,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=173820.0, ans=0.025 2023-06-18 14:51:09,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=173880.0, ans=0.125 2023-06-18 14:51:10,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.32 vs. limit=10.0 2023-06-18 14:51:52,111 INFO [train.py:996] (0/4) Epoch 1, batch 29000, loss[loss=0.3268, simple_loss=0.3806, pruned_loss=0.1365, over 21327.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3769, pruned_loss=0.1353, over 4276151.94 frames. ], batch size: 159, lr: 2.25e-02, grad_scale: 64.0 2023-06-18 14:52:36,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=174060.0, ans=0.05 2023-06-18 14:52:44,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=174060.0, ans=0.125 2023-06-18 14:52:45,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=174060.0, ans=0.2 2023-06-18 14:53:31,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=174180.0, ans=0.0 2023-06-18 14:53:52,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 3.056e+02 3.518e+02 4.354e+02 7.767e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-18 14:54:47,782 INFO [train.py:996] (0/4) Epoch 1, batch 29050, loss[loss=0.2814, simple_loss=0.3279, pruned_loss=0.1175, over 21821.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3768, pruned_loss=0.1377, over 4281373.43 frames. ], batch size: 247, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:56:55,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174540.0, ans=0.125 2023-06-18 14:57:12,114 INFO [train.py:996] (0/4) Epoch 1, batch 29100, loss[loss=0.2749, simple_loss=0.3238, pruned_loss=0.113, over 21810.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3671, pruned_loss=0.1343, over 4274093.05 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:58:44,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=174780.0, ans=0.125 2023-06-18 14:59:07,693 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.175e+02 3.774e+02 4.756e+02 7.058e+02, threshold=7.548e+02, percent-clipped=1.0 2023-06-18 14:59:25,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174840.0, ans=0.125 2023-06-18 14:59:27,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-18 14:59:38,223 INFO [train.py:996] (0/4) Epoch 1, batch 29150, loss[loss=0.3321, simple_loss=0.3897, pruned_loss=0.1372, over 21529.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3664, pruned_loss=0.1319, over 4261433.61 frames. ], batch size: 389, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 14:59:39,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-18 15:00:11,519 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-18 15:00:34,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=174960.0, ans=0.0 2023-06-18 15:01:09,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=175080.0, ans=0.125 2023-06-18 15:01:14,723 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.48 vs. limit=22.5 2023-06-18 15:01:16,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=175080.0, ans=0.2 2023-06-18 15:01:52,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=175140.0, ans=0.125 2023-06-18 15:02:04,916 INFO [train.py:996] (0/4) Epoch 1, batch 29200, loss[loss=0.2716, simple_loss=0.3157, pruned_loss=0.1137, over 21379.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3595, pruned_loss=0.1293, over 4258123.86 frames. ], batch size: 131, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:02:18,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-18 15:03:56,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175380.0, ans=0.1 2023-06-18 15:04:00,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.999e+02 3.311e+02 3.772e+02 7.580e+02, threshold=6.622e+02, percent-clipped=1.0 2023-06-18 15:04:39,751 INFO [train.py:996] (0/4) Epoch 1, batch 29250, loss[loss=0.3182, simple_loss=0.3879, pruned_loss=0.1242, over 21762.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3558, pruned_loss=0.125, over 4263723.78 frames. ], batch size: 352, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:04:42,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175500.0, ans=0.125 2023-06-18 15:05:47,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=175620.0, ans=0.125 2023-06-18 15:07:21,914 INFO [train.py:996] (0/4) Epoch 1, batch 29300, loss[loss=0.2878, simple_loss=0.3492, pruned_loss=0.1132, over 21324.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3564, pruned_loss=0.1233, over 4265439.43 frames. ], batch size: 176, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:07:22,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=175800.0, ans=0.0 2023-06-18 15:07:42,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=175860.0, ans=0.0 2023-06-18 15:07:42,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=175860.0, ans=0.125 2023-06-18 15:08:34,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175920.0, ans=0.125 2023-06-18 15:08:38,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=175980.0, ans=0.07 2023-06-18 15:09:16,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.106e+02 3.713e+02 4.603e+02 7.072e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-18 15:09:27,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=176040.0, ans=0.0 2023-06-18 15:09:49,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=176100.0, ans=0.125 2023-06-18 15:09:50,098 INFO [train.py:996] (0/4) Epoch 1, batch 29350, loss[loss=0.2724, simple_loss=0.3448, pruned_loss=0.09999, over 21618.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3524, pruned_loss=0.122, over 4262976.29 frames. ], batch size: 263, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:10:03,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=176160.0, ans=0.125 2023-06-18 15:10:14,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=176160.0, ans=0.0 2023-06-18 15:10:33,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=176160.0, ans=0.125 2023-06-18 15:10:41,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=176220.0, ans=0.035 2023-06-18 15:11:22,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=176280.0, ans=0.0 2023-06-18 15:12:23,754 INFO [train.py:996] (0/4) Epoch 1, batch 29400, loss[loss=0.3545, simple_loss=0.3949, pruned_loss=0.157, over 21516.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3489, pruned_loss=0.1184, over 4251717.95 frames. ], batch size: 508, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:13:52,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-18 15:14:09,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.984e+02 3.574e+02 4.623e+02 7.193e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-18 15:15:00,993 INFO [train.py:996] (0/4) Epoch 1, batch 29450, loss[loss=0.3249, simple_loss=0.3711, pruned_loss=0.1394, over 21800.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3483, pruned_loss=0.1182, over 4252947.89 frames. ], batch size: 247, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 15:15:22,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=176700.0, ans=0.0 2023-06-18 15:16:34,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=176880.0, ans=0.035 2023-06-18 15:17:07,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-18 15:17:09,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=176940.0, ans=0.0 2023-06-18 15:17:41,660 INFO [train.py:996] (0/4) Epoch 1, batch 29500, loss[loss=0.2871, simple_loss=0.3376, pruned_loss=0.1183, over 21487.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3558, pruned_loss=0.124, over 4252470.48 frames. ], batch size: 211, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:19:17,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177180.0, ans=0.125 2023-06-18 15:19:28,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.232e+02 3.818e+02 4.731e+02 1.137e+03, threshold=7.636e+02, percent-clipped=8.0 2023-06-18 15:19:30,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=177240.0, ans=0.125 2023-06-18 15:20:13,129 INFO [train.py:996] (0/4) Epoch 1, batch 29550, loss[loss=0.3235, simple_loss=0.3666, pruned_loss=0.1401, over 21895.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3547, pruned_loss=0.1261, over 4264131.89 frames. ], batch size: 414, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:20:30,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.51 vs. limit=15.0 2023-06-18 15:22:47,025 INFO [train.py:996] (0/4) Epoch 1, batch 29600, loss[loss=0.3019, simple_loss=0.3721, pruned_loss=0.1158, over 21739.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3609, pruned_loss=0.1291, over 4273586.48 frames. ], batch size: 247, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:23:06,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=177600.0, ans=0.125 2023-06-18 15:23:34,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177660.0, ans=0.125 2023-06-18 15:23:40,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177660.0, ans=0.1 2023-06-18 15:24:06,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.44 vs. limit=22.5 2023-06-18 15:24:35,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-18 15:24:54,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.992e+02 3.426e+02 4.261e+02 6.477e+02, threshold=6.851e+02, percent-clipped=0.0 2023-06-18 15:24:59,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=177840.0, ans=0.125 2023-06-18 15:25:26,151 INFO [train.py:996] (0/4) Epoch 1, batch 29650, loss[loss=0.3295, simple_loss=0.374, pruned_loss=0.1425, over 21801.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3601, pruned_loss=0.1251, over 4275696.23 frames. ], batch size: 107, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:26:56,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=178020.0, ans=0.0 2023-06-18 15:27:05,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-18 15:28:08,856 INFO [train.py:996] (0/4) Epoch 1, batch 29700, loss[loss=0.3555, simple_loss=0.4498, pruned_loss=0.1306, over 21265.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3625, pruned_loss=0.126, over 4278078.62 frames. ], batch size: 548, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 15:28:21,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=178200.0, ans=0.125 2023-06-18 15:28:25,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178200.0, ans=0.1 2023-06-18 15:28:26,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=178200.0, ans=0.125 2023-06-18 15:28:54,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=178260.0, ans=0.125 2023-06-18 15:29:20,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=178320.0, ans=0.125 2023-06-18 15:30:14,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.216e+02 3.981e+02 5.488e+02 8.524e+02, threshold=7.962e+02, percent-clipped=7.0 2023-06-18 15:30:16,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=178440.0, ans=0.0 2023-06-18 15:30:20,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=178440.0, ans=0.125 2023-06-18 15:30:27,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=178440.0, ans=0.125 2023-06-18 15:30:31,206 INFO [train.py:996] (0/4) Epoch 1, batch 29750, loss[loss=0.3109, simple_loss=0.3924, pruned_loss=0.1147, over 21850.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3661, pruned_loss=0.1242, over 4277390.50 frames. ], batch size: 316, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:31:08,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=178500.0, ans=10.0 2023-06-18 15:31:48,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-18 15:33:17,750 INFO [train.py:996] (0/4) Epoch 1, batch 29800, loss[loss=0.2905, simple_loss=0.3466, pruned_loss=0.1172, over 21797.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3677, pruned_loss=0.1249, over 4276742.03 frames. ], batch size: 282, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:34:19,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=178920.0, ans=0.125 2023-06-18 15:34:47,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=178920.0, ans=0.125 2023-06-18 15:35:23,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 3.012e+02 3.384e+02 3.968e+02 6.046e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-18 15:35:30,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=179040.0, ans=0.125 2023-06-18 15:35:44,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=179040.0, ans=0.125 2023-06-18 15:35:46,771 INFO [train.py:996] (0/4) Epoch 1, batch 29850, loss[loss=0.2845, simple_loss=0.3434, pruned_loss=0.1129, over 21702.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3639, pruned_loss=0.1224, over 4276405.49 frames. ], batch size: 389, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:36:42,642 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:37:11,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.18 vs. limit=22.5 2023-06-18 15:37:59,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=179340.0, ans=0.125 2023-06-18 15:38:20,889 INFO [train.py:996] (0/4) Epoch 1, batch 29900, loss[loss=0.3078, simple_loss=0.3495, pruned_loss=0.133, over 21825.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3613, pruned_loss=0.1239, over 4279044.78 frames. ], batch size: 298, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:38:30,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=179400.0, ans=0.125 2023-06-18 15:39:32,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-06-18 15:40:02,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=179580.0, ans=0.0 2023-06-18 15:40:07,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=179580.0, ans=10.0 2023-06-18 15:40:12,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.008e+02 3.612e+02 4.366e+02 7.864e+02, threshold=7.225e+02, percent-clipped=1.0 2023-06-18 15:40:17,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=179640.0, ans=0.0 2023-06-18 15:40:49,209 INFO [train.py:996] (0/4) Epoch 1, batch 29950, loss[loss=0.3477, simple_loss=0.3901, pruned_loss=0.1526, over 21482.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3662, pruned_loss=0.13, over 4277778.91 frames. ], batch size: 194, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:42:38,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=179880.0, ans=0.125 2023-06-18 15:43:22,198 INFO [train.py:996] (0/4) Epoch 1, batch 30000, loss[loss=0.2621, simple_loss=0.3398, pruned_loss=0.09219, over 21612.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3693, pruned_loss=0.1305, over 4275472.91 frames. ], batch size: 230, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 15:43:22,200 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 15:44:17,106 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.2715, simple_loss=0.3724, pruned_loss=0.08526, over 1796401.00 frames. 2023-06-18 15:44:17,106 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 15:44:26,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-18 15:44:32,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-18 15:44:44,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.09 vs. limit=10.0 2023-06-18 15:45:12,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=180120.0, ans=0.125 2023-06-18 15:46:21,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.874e+02 3.529e+02 4.809e+02 8.528e+02, threshold=7.059e+02, percent-clipped=3.0 2023-06-18 15:46:49,221 INFO [train.py:996] (0/4) Epoch 1, batch 30050, loss[loss=0.3148, simple_loss=0.4055, pruned_loss=0.112, over 21754.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3713, pruned_loss=0.1254, over 4274794.85 frames. ], batch size: 351, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:47:23,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=180300.0, ans=0.0 2023-06-18 15:47:57,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=180420.0, ans=0.04949747468305833 2023-06-18 15:48:21,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=180480.0, ans=0.125 2023-06-18 15:48:55,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=180540.0, ans=0.125 2023-06-18 15:49:30,835 INFO [train.py:996] (0/4) Epoch 1, batch 30100, loss[loss=0.2626, simple_loss=0.3131, pruned_loss=0.106, over 21720.00 frames. ], tot_loss[loss=0.3099, simple_loss=0.3695, pruned_loss=0.1251, over 4261745.15 frames. ], batch size: 282, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:50:54,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=180780.0, ans=0.0 2023-06-18 15:51:12,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.465e+02 4.064e+02 4.791e+02 7.822e+02, threshold=8.128e+02, percent-clipped=3.0 2023-06-18 15:51:37,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=180840.0, ans=0.125 2023-06-18 15:51:54,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-18 15:51:59,482 INFO [train.py:996] (0/4) Epoch 1, batch 30150, loss[loss=0.3281, simple_loss=0.3762, pruned_loss=0.14, over 21282.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3665, pruned_loss=0.128, over 4268944.90 frames. ], batch size: 159, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:52:25,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=180960.0, ans=0.125 2023-06-18 15:53:16,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-18 15:53:17,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=181020.0, ans=10.0 2023-06-18 15:53:27,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-18 15:54:19,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.62 vs. limit=15.0 2023-06-18 15:54:32,762 INFO [train.py:996] (0/4) Epoch 1, batch 30200, loss[loss=0.3873, simple_loss=0.4504, pruned_loss=0.1621, over 21422.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3699, pruned_loss=0.1279, over 4267897.44 frames. ], batch size: 507, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:54:33,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=15.0 2023-06-18 15:55:04,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=181200.0, ans=0.125 2023-06-18 15:55:07,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=181200.0, ans=0.125 2023-06-18 15:56:44,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 2.883e+02 3.649e+02 4.824e+02 8.627e+02, threshold=7.297e+02, percent-clipped=2.0 2023-06-18 15:57:37,571 INFO [train.py:996] (0/4) Epoch 1, batch 30250, loss[loss=0.316, simple_loss=0.3997, pruned_loss=0.1162, over 21423.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3799, pruned_loss=0.1317, over 4268791.24 frames. ], batch size: 131, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 15:58:09,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181560.0, ans=0.125 2023-06-18 15:58:52,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181620.0, ans=0.125 2023-06-18 15:59:37,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=181740.0, ans=0.2 2023-06-18 16:00:04,191 INFO [train.py:996] (0/4) Epoch 1, batch 30300, loss[loss=0.2707, simple_loss=0.3224, pruned_loss=0.1095, over 21608.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.375, pruned_loss=0.1311, over 4268807.39 frames. ], batch size: 298, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 16:00:29,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=181800.0, ans=0.0 2023-06-18 16:02:11,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.225e+02 4.059e+02 5.033e+02 9.732e+02, threshold=8.118e+02, percent-clipped=4.0 2023-06-18 16:02:49,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=182040.0, ans=0.125 2023-06-18 16:02:51,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-18 16:02:58,932 INFO [train.py:996] (0/4) Epoch 1, batch 30350, loss[loss=0.2577, simple_loss=0.3075, pruned_loss=0.1039, over 21275.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3767, pruned_loss=0.1323, over 4271378.19 frames. ], batch size: 176, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:03:01,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-18 16:03:32,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-18 16:04:38,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=182280.0, ans=0.025 2023-06-18 16:04:44,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182340.0, ans=0.1 2023-06-18 16:05:57,808 INFO [train.py:996] (0/4) Epoch 1, batch 30400, loss[loss=0.3105, simple_loss=0.3311, pruned_loss=0.145, over 20174.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3673, pruned_loss=0.1283, over 4264263.79 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:07:10,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=182460.0, ans=0.125 2023-06-18 16:07:49,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=182460.0, ans=0.2 2023-06-18 16:08:48,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=182520.0, ans=0.05 2023-06-18 16:09:40,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=182580.0, ans=0.0 2023-06-18 16:09:50,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.733e+02 4.742e+02 5.817e+02 2.279e+03, threshold=9.485e+02, percent-clipped=8.0 2023-06-18 16:11:10,019 INFO [train.py:996] (0/4) Epoch 1, batch 30450, loss[loss=0.379, simple_loss=0.4578, pruned_loss=0.1501, over 19950.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3698, pruned_loss=0.1307, over 4204834.47 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 16:11:16,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=182700.0, ans=0.0 2023-06-18 16:12:57,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=182760.0, ans=0.0 2023-06-18 16:12:59,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=182820.0, ans=0.125 2023-06-18 16:13:09,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-18 16:13:50,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=12.0 2023-06-18 16:15:07,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=182940.0, ans=0.0 2023-06-18 16:15:13,724 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-1.pt 2023-06-18 16:17:54,930 INFO [train.py:996] (0/4) Epoch 2, batch 0, loss[loss=0.4236, simple_loss=0.4088, pruned_loss=0.2192, over 21351.00 frames. ], tot_loss[loss=0.4236, simple_loss=0.4088, pruned_loss=0.2192, over 21351.00 frames. ], batch size: 473, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 16:17:54,931 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 16:18:35,907 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.1918, 2.3904, 4.0255, 2.0818], device='cuda:0') 2023-06-18 16:18:53,127 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2985, simple_loss=0.394, pruned_loss=0.1016, over 1796401.00 frames. 2023-06-18 16:18:53,127 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 16:19:28,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=183090.0, ans=0.0 2023-06-18 16:19:48,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=183150.0, ans=0.125 2023-06-18 16:20:34,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.662e+02 5.243e+02 7.950e+02 2.244e+03, threshold=1.049e+03, percent-clipped=17.0 2023-06-18 16:20:34,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=183210.0, ans=0.125 2023-06-18 16:20:41,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=183210.0, ans=0.125 2023-06-18 16:20:43,983 INFO [train.py:996] (0/4) Epoch 2, batch 50, loss[loss=0.2785, simple_loss=0.3495, pruned_loss=0.1038, over 21695.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3712, pruned_loss=0.1289, over 962889.02 frames. ], batch size: 332, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:20:52,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.97 vs. limit=15.0 2023-06-18 16:22:07,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183390.0, ans=0.1 2023-06-18 16:22:10,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.66 vs. limit=10.0 2023-06-18 16:22:18,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=183390.0, ans=0.0 2023-06-18 16:22:53,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=183510.0, ans=0.0 2023-06-18 16:22:57,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=183510.0, ans=0.125 2023-06-18 16:23:20,714 INFO [train.py:996] (0/4) Epoch 2, batch 100, loss[loss=0.3235, simple_loss=0.4004, pruned_loss=0.1233, over 21840.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3845, pruned_loss=0.1293, over 1688939.06 frames. ], batch size: 316, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:23:32,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=183570.0, ans=0.125 2023-06-18 16:24:16,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-18 16:25:03,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=183750.0, ans=0.125 2023-06-18 16:25:18,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=183810.0, ans=0.0 2023-06-18 16:25:24,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.827e+02 3.490e+02 4.158e+02 9.308e+02, threshold=6.980e+02, percent-clipped=0.0 2023-06-18 16:25:25,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-18 16:25:29,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=183810.0, ans=0.125 2023-06-18 16:25:41,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=183810.0, ans=0.125 2023-06-18 16:25:41,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=183810.0, ans=0.125 2023-06-18 16:25:44,221 INFO [train.py:996] (0/4) Epoch 2, batch 150, loss[loss=0.3134, simple_loss=0.3899, pruned_loss=0.1185, over 21745.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3915, pruned_loss=0.1326, over 2263587.57 frames. ], batch size: 351, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:25:44,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=183870.0, ans=0.125 2023-06-18 16:26:09,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=183870.0, ans=0.025 2023-06-18 16:27:04,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183990.0, ans=0.1 2023-06-18 16:27:11,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=184050.0, ans=0.0 2023-06-18 16:27:26,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=184050.0, ans=0.2 2023-06-18 16:28:08,794 INFO [train.py:996] (0/4) Epoch 2, batch 200, loss[loss=0.3052, simple_loss=0.361, pruned_loss=0.1248, over 21162.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3847, pruned_loss=0.1285, over 2711956.84 frames. ], batch size: 143, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:28:26,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-18 16:28:47,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=184230.0, ans=0.035 2023-06-18 16:29:49,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184350.0, ans=0.1 2023-06-18 16:30:03,482 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.883e+02 3.671e+02 4.524e+02 7.455e+02, threshold=7.342e+02, percent-clipped=3.0 2023-06-18 16:30:27,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=184410.0, ans=0.125 2023-06-18 16:30:31,430 INFO [train.py:996] (0/4) Epoch 2, batch 250, loss[loss=0.2837, simple_loss=0.3443, pruned_loss=0.1115, over 21258.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3788, pruned_loss=0.1267, over 3054691.07 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:30:33,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=184470.0, ans=0.125 2023-06-18 16:30:36,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=184470.0, ans=0.0 2023-06-18 16:32:40,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2023-06-18 16:33:00,000 INFO [train.py:996] (0/4) Epoch 2, batch 300, loss[loss=0.3121, simple_loss=0.3517, pruned_loss=0.1362, over 21869.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3731, pruned_loss=0.1251, over 3332174.56 frames. ], batch size: 373, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:35:08,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184950.0, ans=0.125 2023-06-18 16:35:09,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.96 vs. limit=12.0 2023-06-18 16:35:16,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.885e+02 3.453e+02 4.241e+02 7.673e+02, threshold=6.906e+02, percent-clipped=1.0 2023-06-18 16:35:32,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=185010.0, ans=0.125 2023-06-18 16:35:39,485 INFO [train.py:996] (0/4) Epoch 2, batch 350, loss[loss=0.2647, simple_loss=0.3295, pruned_loss=0.0999, over 21207.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3648, pruned_loss=0.1233, over 3532574.88 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 16:37:06,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=185190.0, ans=0.125 2023-06-18 16:37:19,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185250.0, ans=0.1 2023-06-18 16:38:01,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-18 16:38:03,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=185310.0, ans=0.0 2023-06-18 16:38:05,896 INFO [train.py:996] (0/4) Epoch 2, batch 400, loss[loss=0.2585, simple_loss=0.3036, pruned_loss=0.1067, over 21625.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.358, pruned_loss=0.1217, over 3694133.34 frames. ], batch size: 247, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:38:44,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=185370.0, ans=0.2 2023-06-18 16:39:00,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=185430.0, ans=0.125 2023-06-18 16:39:52,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=185550.0, ans=0.05 2023-06-18 16:40:25,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.835e+02 3.626e+02 4.923e+02 9.458e+02, threshold=7.251e+02, percent-clipped=2.0 2023-06-18 16:40:42,455 INFO [train.py:996] (0/4) Epoch 2, batch 450, loss[loss=0.2416, simple_loss=0.3006, pruned_loss=0.09134, over 21267.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3513, pruned_loss=0.1183, over 3827268.99 frames. ], batch size: 176, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:41:13,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-18 16:41:26,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=185730.0, ans=0.125 2023-06-18 16:42:42,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=185910.0, ans=0.125 2023-06-18 16:43:08,792 INFO [train.py:996] (0/4) Epoch 2, batch 500, loss[loss=0.2601, simple_loss=0.31, pruned_loss=0.1051, over 21539.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3544, pruned_loss=0.1162, over 3929509.71 frames. ], batch size: 263, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:44:37,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-18 16:45:04,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=186150.0, ans=0.0 2023-06-18 16:45:18,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.995e+02 3.727e+02 5.025e+02 8.561e+02, threshold=7.454e+02, percent-clipped=5.0 2023-06-18 16:45:43,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=22.5 2023-06-18 16:45:49,758 INFO [train.py:996] (0/4) Epoch 2, batch 550, loss[loss=0.2393, simple_loss=0.31, pruned_loss=0.08426, over 21602.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3556, pruned_loss=0.1152, over 4007972.38 frames. ], batch size: 263, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:45:50,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186270.0, ans=0.1 2023-06-18 16:45:53,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.92 vs. limit=15.0 2023-06-18 16:46:10,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-18 16:46:51,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=186390.0, ans=0.125 2023-06-18 16:47:29,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=186450.0, ans=0.125 2023-06-18 16:48:05,027 INFO [train.py:996] (0/4) Epoch 2, batch 600, loss[loss=0.3335, simple_loss=0.3918, pruned_loss=0.1376, over 21906.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3557, pruned_loss=0.1149, over 4064789.56 frames. ], batch size: 316, lr: 1.99e-02, grad_scale: 64.0 2023-06-18 16:50:14,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=186810.0, ans=0.125 2023-06-18 16:50:17,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.275e+02 3.817e+02 5.405e+02 1.141e+03, threshold=7.634e+02, percent-clipped=7.0 2023-06-18 16:50:26,073 INFO [train.py:996] (0/4) Epoch 2, batch 650, loss[loss=0.3095, simple_loss=0.3538, pruned_loss=0.1326, over 14842.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3566, pruned_loss=0.1156, over 4104542.14 frames. ], batch size: 60, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 16:50:56,747 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:51:05,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=186930.0, ans=0.0 2023-06-18 16:51:11,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=186930.0, ans=0.125 2023-06-18 16:51:18,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=186990.0, ans=0.125 2023-06-18 16:52:02,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=187050.0, ans=0.2 2023-06-18 16:52:04,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=187050.0, ans=0.0 2023-06-18 16:53:00,526 INFO [train.py:996] (0/4) Epoch 2, batch 700, loss[loss=0.283, simple_loss=0.3535, pruned_loss=0.1062, over 15798.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3575, pruned_loss=0.1163, over 4140571.42 frames. ], batch size: 60, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:53:34,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=187170.0, ans=0.0 2023-06-18 16:54:08,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=187290.0, ans=0.125 2023-06-18 16:54:39,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=187350.0, ans=0.125 2023-06-18 16:54:56,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=187350.0, ans=0.0 2023-06-18 16:55:08,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 3.018e+02 3.846e+02 4.638e+02 8.103e+02, threshold=7.692e+02, percent-clipped=2.0 2023-06-18 16:55:22,985 INFO [train.py:996] (0/4) Epoch 2, batch 750, loss[loss=0.2859, simple_loss=0.34, pruned_loss=0.1159, over 21926.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3566, pruned_loss=0.118, over 4182621.15 frames. ], batch size: 316, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:55:35,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187470.0, ans=0.125 2023-06-18 16:57:49,854 INFO [train.py:996] (0/4) Epoch 2, batch 800, loss[loss=0.2982, simple_loss=0.344, pruned_loss=0.1262, over 21450.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3565, pruned_loss=0.1194, over 4207460.01 frames. ], batch size: 211, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:57:50,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=187770.0, ans=0.125 2023-06-18 16:57:50,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=187770.0, ans=0.0 2023-06-18 16:57:51,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=187770.0, ans=0.0 2023-06-18 16:57:57,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=187770.0, ans=0.125 2023-06-18 16:58:27,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=187830.0, ans=0.125 2023-06-18 16:58:27,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.81 vs. limit=15.0 2023-06-18 16:58:40,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=187890.0, ans=0.125 2023-06-18 16:59:24,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.112e+02 3.737e+02 4.664e+02 8.129e+02, threshold=7.474e+02, percent-clipped=2.0 2023-06-18 16:59:45,758 INFO [train.py:996] (0/4) Epoch 2, batch 850, loss[loss=0.3187, simple_loss=0.3646, pruned_loss=0.1364, over 21929.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3538, pruned_loss=0.119, over 4225132.38 frames. ], batch size: 107, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 16:59:49,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=188070.0, ans=0.125 2023-06-18 16:59:59,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=188130.0, ans=0.125 2023-06-18 17:01:31,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=188310.0, ans=0.2 2023-06-18 17:01:48,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=188310.0, ans=0.2 2023-06-18 17:02:00,151 INFO [train.py:996] (0/4) Epoch 2, batch 900, loss[loss=0.3415, simple_loss=0.3736, pruned_loss=0.1547, over 21592.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.352, pruned_loss=0.1188, over 4240289.87 frames. ], batch size: 471, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 17:02:37,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188430.0, ans=0.1 2023-06-18 17:03:15,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=188550.0, ans=0.0 2023-06-18 17:03:15,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=188550.0, ans=0.125 2023-06-18 17:03:44,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.912e+02 3.443e+02 3.950e+02 7.749e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-18 17:03:45,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188610.0, ans=0.1 2023-06-18 17:04:02,593 INFO [train.py:996] (0/4) Epoch 2, batch 950, loss[loss=0.2745, simple_loss=0.3424, pruned_loss=0.1033, over 21864.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3479, pruned_loss=0.1171, over 4239167.21 frames. ], batch size: 351, lr: 1.98e-02, grad_scale: 16.0 2023-06-18 17:04:17,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=188670.0, ans=0.125 2023-06-18 17:06:11,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=188910.0, ans=0.125 2023-06-18 17:06:18,724 INFO [train.py:996] (0/4) Epoch 2, batch 1000, loss[loss=0.2499, simple_loss=0.3005, pruned_loss=0.09967, over 21585.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3475, pruned_loss=0.1169, over 4251648.35 frames. ], batch size: 247, lr: 1.98e-02, grad_scale: 16.0 2023-06-18 17:07:05,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=189090.0, ans=0.125 2023-06-18 17:08:08,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-18 17:08:20,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.046e+02 3.757e+02 4.775e+02 8.015e+02, threshold=7.515e+02, percent-clipped=4.0 2023-06-18 17:08:26,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=189270.0, ans=0.125 2023-06-18 17:08:27,772 INFO [train.py:996] (0/4) Epoch 2, batch 1050, loss[loss=0.2863, simple_loss=0.3523, pruned_loss=0.1101, over 21307.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3502, pruned_loss=0.1178, over 4262925.13 frames. ], batch size: 159, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:08:59,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-18 17:09:03,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=189330.0, ans=0.125 2023-06-18 17:10:03,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=6.0 2023-06-18 17:10:17,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=189510.0, ans=0.0 2023-06-18 17:10:36,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=189510.0, ans=0.0 2023-06-18 17:10:42,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=189510.0, ans=0.0 2023-06-18 17:10:46,338 INFO [train.py:996] (0/4) Epoch 2, batch 1100, loss[loss=0.3145, simple_loss=0.3753, pruned_loss=0.1268, over 21614.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3503, pruned_loss=0.1174, over 4276520.47 frames. ], batch size: 441, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:12:06,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-18 17:12:30,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.384e+02 3.965e+02 4.947e+02 9.506e+02, threshold=7.929e+02, percent-clipped=4.0 2023-06-18 17:12:48,323 INFO [train.py:996] (0/4) Epoch 2, batch 1150, loss[loss=0.3055, simple_loss=0.3754, pruned_loss=0.1179, over 21808.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3502, pruned_loss=0.1176, over 4274310.48 frames. ], batch size: 371, lr: 1.97e-02, grad_scale: 16.0 2023-06-18 17:13:07,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=189870.0, ans=0.0 2023-06-18 17:13:20,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=189930.0, ans=10.0 2023-06-18 17:13:40,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=189990.0, ans=0.0 2023-06-18 17:13:45,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=189990.0, ans=0.2 2023-06-18 17:13:53,926 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:14:20,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=190050.0, ans=0.125 2023-06-18 17:14:22,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-18 17:14:26,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-18 17:15:12,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=22.5 2023-06-18 17:15:15,718 INFO [train.py:996] (0/4) Epoch 2, batch 1200, loss[loss=0.2892, simple_loss=0.3661, pruned_loss=0.1062, over 21632.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3509, pruned_loss=0.1174, over 4280023.56 frames. ], batch size: 389, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:15:18,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 17:15:51,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.36 vs. limit=10.0 2023-06-18 17:17:04,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.111e+02 4.019e+02 4.816e+02 7.916e+02, threshold=8.038e+02, percent-clipped=0.0 2023-06-18 17:17:11,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=190410.0, ans=0.0 2023-06-18 17:17:25,271 INFO [train.py:996] (0/4) Epoch 2, batch 1250, loss[loss=0.3096, simple_loss=0.3754, pruned_loss=0.1219, over 21703.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3558, pruned_loss=0.1198, over 4291839.51 frames. ], batch size: 389, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:18:22,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=190590.0, ans=0.5 2023-06-18 17:19:41,850 INFO [train.py:996] (0/4) Epoch 2, batch 1300, loss[loss=0.2719, simple_loss=0.3325, pruned_loss=0.1057, over 21439.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.358, pruned_loss=0.1205, over 4293336.23 frames. ], batch size: 211, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:20:13,432 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:20:46,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=190890.0, ans=0.2 2023-06-18 17:21:06,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=190950.0, ans=0.0 2023-06-18 17:21:38,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191010.0, ans=0.0 2023-06-18 17:21:41,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.359e+02 4.196e+02 5.340e+02 8.417e+02, threshold=8.392e+02, percent-clipped=1.0 2023-06-18 17:21:58,779 INFO [train.py:996] (0/4) Epoch 2, batch 1350, loss[loss=0.2957, simple_loss=0.3415, pruned_loss=0.1249, over 21851.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3592, pruned_loss=0.1211, over 4295449.56 frames. ], batch size: 282, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 17:22:31,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=191130.0, ans=0.0 2023-06-18 17:22:31,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=191130.0, ans=0.125 2023-06-18 17:23:13,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-18 17:24:00,734 INFO [train.py:996] (0/4) Epoch 2, batch 1400, loss[loss=0.2764, simple_loss=0.3362, pruned_loss=0.1083, over 15233.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3563, pruned_loss=0.121, over 4293911.88 frames. ], batch size: 60, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:24:23,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=191370.0, ans=0.0 2023-06-18 17:24:40,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=191430.0, ans=0.0 2023-06-18 17:24:52,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-18 17:24:56,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=191490.0, ans=0.125 2023-06-18 17:25:59,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=191610.0, ans=0.2 2023-06-18 17:26:00,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.213e+02 3.659e+02 4.697e+02 8.447e+02, threshold=7.317e+02, percent-clipped=1.0 2023-06-18 17:26:07,941 INFO [train.py:996] (0/4) Epoch 2, batch 1450, loss[loss=0.2681, simple_loss=0.3327, pruned_loss=0.1017, over 21838.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.356, pruned_loss=0.1216, over 4292533.19 frames. ], batch size: 102, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:26:24,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=191670.0, ans=0.125 2023-06-18 17:27:08,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=191790.0, ans=0.125 2023-06-18 17:27:11,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191790.0, ans=0.1 2023-06-18 17:27:12,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=191790.0, ans=0.125 2023-06-18 17:27:14,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-18 17:27:17,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=191790.0, ans=0.2 2023-06-18 17:28:22,341 INFO [train.py:996] (0/4) Epoch 2, batch 1500, loss[loss=0.3312, simple_loss=0.3843, pruned_loss=0.139, over 21214.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3582, pruned_loss=0.1233, over 4294622.49 frames. ], batch size: 143, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:28:25,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=191970.0, ans=0.2 2023-06-18 17:28:28,232 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-32000.pt 2023-06-18 17:28:54,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=192030.0, ans=0.125 2023-06-18 17:29:57,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=192150.0, ans=0.035 2023-06-18 17:30:25,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.308e+02 4.250e+02 5.444e+02 1.048e+03, threshold=8.500e+02, percent-clipped=7.0 2023-06-18 17:30:27,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=192210.0, ans=0.125 2023-06-18 17:30:39,390 INFO [train.py:996] (0/4) Epoch 2, batch 1550, loss[loss=0.2344, simple_loss=0.285, pruned_loss=0.09187, over 21629.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3554, pruned_loss=0.1218, over 4294116.19 frames. ], batch size: 247, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:30:40,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-18 17:30:51,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192270.0, ans=0.125 2023-06-18 17:31:01,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-18 17:31:42,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=192390.0, ans=0.0 2023-06-18 17:32:53,938 INFO [train.py:996] (0/4) Epoch 2, batch 1600, loss[loss=0.2896, simple_loss=0.3551, pruned_loss=0.112, over 21721.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3529, pruned_loss=0.1207, over 4278376.05 frames. ], batch size: 351, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:33:10,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192630.0, ans=0.125 2023-06-18 17:33:29,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-18 17:34:15,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-18 17:34:34,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-18 17:34:47,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.115e+02 3.710e+02 4.975e+02 8.119e+02, threshold=7.421e+02, percent-clipped=0.0 2023-06-18 17:34:54,920 INFO [train.py:996] (0/4) Epoch 2, batch 1650, loss[loss=0.2435, simple_loss=0.329, pruned_loss=0.079, over 21565.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3529, pruned_loss=0.1195, over 4281644.69 frames. ], batch size: 230, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:34:56,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=192870.0, ans=0.125 2023-06-18 17:36:47,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=193050.0, ans=0.125 2023-06-18 17:37:10,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193110.0, ans=0.1 2023-06-18 17:37:12,716 INFO [train.py:996] (0/4) Epoch 2, batch 1700, loss[loss=0.323, simple_loss=0.3769, pruned_loss=0.1345, over 21309.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3567, pruned_loss=0.1212, over 4281702.60 frames. ], batch size: 159, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 17:38:00,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=193230.0, ans=0.0 2023-06-18 17:38:03,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=193230.0, ans=0.125 2023-06-18 17:38:04,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=193290.0, ans=0.2 2023-06-18 17:38:38,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=193350.0, ans=0.125 2023-06-18 17:39:16,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.411e+02 4.217e+02 5.262e+02 9.170e+02, threshold=8.435e+02, percent-clipped=4.0 2023-06-18 17:39:18,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=193410.0, ans=0.2 2023-06-18 17:39:23,253 INFO [train.py:996] (0/4) Epoch 2, batch 1750, loss[loss=0.2187, simple_loss=0.3039, pruned_loss=0.06674, over 21716.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3575, pruned_loss=0.119, over 4286521.53 frames. ], batch size: 298, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:39:51,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=8.0 2023-06-18 17:41:45,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=193710.0, ans=0.125 2023-06-18 17:41:59,273 INFO [train.py:996] (0/4) Epoch 2, batch 1800, loss[loss=0.2455, simple_loss=0.3028, pruned_loss=0.09412, over 21182.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3542, pruned_loss=0.1157, over 4289207.20 frames. ], batch size: 159, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:42:55,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193890.0, ans=0.1 2023-06-18 17:43:55,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-18 17:44:07,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=194010.0, ans=0.125 2023-06-18 17:44:08,482 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.686e+02 3.358e+02 3.965e+02 7.229e+02, threshold=6.717e+02, percent-clipped=0.0 2023-06-18 17:44:15,776 INFO [train.py:996] (0/4) Epoch 2, batch 1850, loss[loss=0.2809, simple_loss=0.3445, pruned_loss=0.1086, over 21658.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3524, pruned_loss=0.1116, over 4291878.64 frames. ], batch size: 263, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:45:43,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=194250.0, ans=0.125 2023-06-18 17:46:25,748 INFO [train.py:996] (0/4) Epoch 2, batch 1900, loss[loss=0.3413, simple_loss=0.413, pruned_loss=0.1348, over 21471.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3543, pruned_loss=0.1136, over 4294567.15 frames. ], batch size: 507, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:46:29,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=194370.0, ans=0.125 2023-06-18 17:46:32,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-18 17:46:44,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-18 17:47:42,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=194550.0, ans=0.125 2023-06-18 17:48:06,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=194610.0, ans=0.0 2023-06-18 17:48:10,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=194610.0, ans=0.2 2023-06-18 17:48:18,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 3.031e+02 3.569e+02 4.348e+02 9.339e+02, threshold=7.139e+02, percent-clipped=5.0 2023-06-18 17:48:25,702 INFO [train.py:996] (0/4) Epoch 2, batch 1950, loss[loss=0.2547, simple_loss=0.3134, pruned_loss=0.09795, over 15430.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3524, pruned_loss=0.1148, over 4276442.60 frames. ], batch size: 60, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:49:05,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=194730.0, ans=0.125 2023-06-18 17:49:28,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=22.5 2023-06-18 17:49:36,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=194790.0, ans=10.0 2023-06-18 17:49:37,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=194790.0, ans=0.125 2023-06-18 17:50:05,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=194910.0, ans=0.125 2023-06-18 17:50:31,166 INFO [train.py:996] (0/4) Epoch 2, batch 2000, loss[loss=0.2268, simple_loss=0.2899, pruned_loss=0.08185, over 21610.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3492, pruned_loss=0.1141, over 4273137.70 frames. ], batch size: 247, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:50:53,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=194970.0, ans=0.125 2023-06-18 17:50:53,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194970.0, ans=0.1 2023-06-18 17:50:55,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-06-18 17:51:08,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=195030.0, ans=0.125 2023-06-18 17:51:10,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=195030.0, ans=0.125 2023-06-18 17:51:13,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=195030.0, ans=0.125 2023-06-18 17:51:16,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=6.0 2023-06-18 17:51:47,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=195090.0, ans=0.07 2023-06-18 17:52:14,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=195210.0, ans=0.125 2023-06-18 17:52:30,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.996e+02 3.369e+02 4.467e+02 7.434e+02, threshold=6.738e+02, percent-clipped=1.0 2023-06-18 17:52:37,738 INFO [train.py:996] (0/4) Epoch 2, batch 2050, loss[loss=0.3083, simple_loss=0.377, pruned_loss=0.1197, over 21776.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3477, pruned_loss=0.112, over 4274213.43 frames. ], batch size: 298, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 17:52:53,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=195270.0, ans=0.125 2023-06-18 17:53:07,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=195330.0, ans=0.0 2023-06-18 17:53:15,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=195330.0, ans=0.125 2023-06-18 17:53:17,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195330.0, ans=0.0 2023-06-18 17:53:17,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195330.0, ans=0.125 2023-06-18 17:53:31,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=195330.0, ans=0.0 2023-06-18 17:53:34,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195390.0, ans=0.0 2023-06-18 17:53:53,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=195390.0, ans=0.025 2023-06-18 17:53:58,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-18 17:54:35,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195510.0, ans=0.0 2023-06-18 17:54:52,689 INFO [train.py:996] (0/4) Epoch 2, batch 2100, loss[loss=0.3123, simple_loss=0.3784, pruned_loss=0.1231, over 21579.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3515, pruned_loss=0.115, over 4283143.00 frames. ], batch size: 230, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:54:57,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=195570.0, ans=0.0 2023-06-18 17:55:03,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=195570.0, ans=0.125 2023-06-18 17:56:47,252 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.911e+02 3.457e+02 4.182e+02 7.593e+02, threshold=6.915e+02, percent-clipped=4.0 2023-06-18 17:56:54,819 INFO [train.py:996] (0/4) Epoch 2, batch 2150, loss[loss=0.28, simple_loss=0.328, pruned_loss=0.116, over 21768.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3552, pruned_loss=0.1175, over 4271481.64 frames. ], batch size: 351, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 17:57:19,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195870.0, ans=0.1 2023-06-18 17:57:22,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=195870.0, ans=0.125 2023-06-18 17:57:49,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=15.0 2023-06-18 17:57:55,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195930.0, ans=0.1 2023-06-18 17:58:33,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196050.0, ans=0.1 2023-06-18 17:58:40,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-18 17:59:17,874 INFO [train.py:996] (0/4) Epoch 2, batch 2200, loss[loss=0.3275, simple_loss=0.3516, pruned_loss=0.1517, over 21434.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.356, pruned_loss=0.1182, over 4279238.05 frames. ], batch size: 510, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:00:07,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=196230.0, ans=0.125 2023-06-18 18:01:21,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.132e+02 3.798e+02 4.923e+02 7.724e+02, threshold=7.595e+02, percent-clipped=3.0 2023-06-18 18:01:26,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=196410.0, ans=0.0 2023-06-18 18:01:28,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=196470.0, ans=0.125 2023-06-18 18:01:29,001 INFO [train.py:996] (0/4) Epoch 2, batch 2250, loss[loss=0.2697, simple_loss=0.3244, pruned_loss=0.1075, over 21680.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3522, pruned_loss=0.1153, over 4279800.71 frames. ], batch size: 263, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:01:30,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=196470.0, ans=0.0 2023-06-18 18:01:45,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-18 18:02:49,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=196650.0, ans=0.2 2023-06-18 18:03:15,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196710.0, ans=0.1 2023-06-18 18:03:24,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.31 vs. limit=10.0 2023-06-18 18:03:27,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=196710.0, ans=0.125 2023-06-18 18:03:31,111 INFO [train.py:996] (0/4) Epoch 2, batch 2300, loss[loss=0.2593, simple_loss=0.3023, pruned_loss=0.1082, over 21429.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.347, pruned_loss=0.1142, over 4282701.49 frames. ], batch size: 211, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:04:48,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=196950.0, ans=0.2 2023-06-18 18:04:55,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=196950.0, ans=0.0 2023-06-18 18:05:12,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-06-18 18:05:16,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.103e+02 3.931e+02 4.901e+02 7.209e+02, threshold=7.862e+02, percent-clipped=0.0 2023-06-18 18:05:29,587 INFO [train.py:996] (0/4) Epoch 2, batch 2350, loss[loss=0.3023, simple_loss=0.3413, pruned_loss=0.1317, over 21562.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3439, pruned_loss=0.115, over 4267855.14 frames. ], batch size: 548, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:05:31,501 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:06:09,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=197130.0, ans=0.0 2023-06-18 18:07:43,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=197310.0, ans=0.0 2023-06-18 18:07:48,757 INFO [train.py:996] (0/4) Epoch 2, batch 2400, loss[loss=0.3078, simple_loss=0.3645, pruned_loss=0.1255, over 21375.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3489, pruned_loss=0.1192, over 4274509.85 frames. ], batch size: 143, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 18:08:19,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=197430.0, ans=10.0 2023-06-18 18:08:36,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197490.0, ans=0.1 2023-06-18 18:08:36,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-18 18:09:28,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=197550.0, ans=0.125 2023-06-18 18:09:34,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=197610.0, ans=0.125 2023-06-18 18:09:47,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.219e+02 3.807e+02 5.185e+02 8.292e+02, threshold=7.615e+02, percent-clipped=0.0 2023-06-18 18:10:11,964 INFO [train.py:996] (0/4) Epoch 2, batch 2450, loss[loss=0.3234, simple_loss=0.3676, pruned_loss=0.1396, over 21536.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3559, pruned_loss=0.122, over 4278486.90 frames. ], batch size: 414, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:10:19,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=197670.0, ans=0.125 2023-06-18 18:10:50,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=197730.0, ans=0.0 2023-06-18 18:12:02,916 INFO [train.py:996] (0/4) Epoch 2, batch 2500, loss[loss=0.2812, simple_loss=0.3675, pruned_loss=0.09744, over 21410.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3529, pruned_loss=0.1219, over 4282547.95 frames. ], batch size: 211, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:12:33,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-18 18:12:49,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-18 18:13:33,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-18 18:14:04,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=198210.0, ans=0.125 2023-06-18 18:14:05,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 3.063e+02 3.765e+02 4.527e+02 8.351e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-18 18:14:20,545 INFO [train.py:996] (0/4) Epoch 2, batch 2550, loss[loss=0.2789, simple_loss=0.3645, pruned_loss=0.09664, over 21559.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3506, pruned_loss=0.1199, over 4268823.40 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:14:49,259 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:15:36,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=198450.0, ans=0.125 2023-06-18 18:16:35,998 INFO [train.py:996] (0/4) Epoch 2, batch 2600, loss[loss=0.3721, simple_loss=0.3971, pruned_loss=0.1736, over 21337.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3528, pruned_loss=0.1215, over 4264235.84 frames. ], batch size: 471, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:18:35,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.099e+02 3.578e+02 4.381e+02 8.745e+02, threshold=7.157e+02, percent-clipped=2.0 2023-06-18 18:18:47,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-18 18:18:47,910 INFO [train.py:996] (0/4) Epoch 2, batch 2650, loss[loss=0.2789, simple_loss=0.3264, pruned_loss=0.1157, over 21379.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.355, pruned_loss=0.1233, over 4271568.04 frames. ], batch size: 176, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:19:12,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.95 vs. limit=22.5 2023-06-18 18:19:22,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=198930.0, ans=0.0 2023-06-18 18:19:56,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-18 18:20:56,650 INFO [train.py:996] (0/4) Epoch 2, batch 2700, loss[loss=0.2406, simple_loss=0.2953, pruned_loss=0.09294, over 21623.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3544, pruned_loss=0.1216, over 4270504.83 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:20:57,010 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:20:57,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=199170.0, ans=0.0 2023-06-18 18:21:21,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=199230.0, ans=0.125 2023-06-18 18:21:40,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=199290.0, ans=0.0 2023-06-18 18:21:40,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=15.0 2023-06-18 18:21:59,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-18 18:22:20,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-18 18:22:22,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=199410.0, ans=0.0 2023-06-18 18:22:39,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.957e+02 3.708e+02 4.749e+02 8.354e+02, threshold=7.415e+02, percent-clipped=2.0 2023-06-18 18:22:52,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0 2023-06-18 18:22:53,015 INFO [train.py:996] (0/4) Epoch 2, batch 2750, loss[loss=0.3019, simple_loss=0.4089, pruned_loss=0.09739, over 19710.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3511, pruned_loss=0.1207, over 4271869.20 frames. ], batch size: 702, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 18:24:56,955 INFO [train.py:996] (0/4) Epoch 2, batch 2800, loss[loss=0.355, simple_loss=0.4136, pruned_loss=0.1482, over 21797.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3569, pruned_loss=0.1221, over 4275388.88 frames. ], batch size: 332, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:25:25,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=199770.0, ans=0.125 2023-06-18 18:25:26,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=199770.0, ans=0.0 2023-06-18 18:27:01,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=200010.0, ans=0.0 2023-06-18 18:27:04,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=200010.0, ans=0.125 2023-06-18 18:27:05,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.576e+02 4.220e+02 5.445e+02 9.845e+02, threshold=8.440e+02, percent-clipped=5.0 2023-06-18 18:27:12,998 INFO [train.py:996] (0/4) Epoch 2, batch 2850, loss[loss=0.2556, simple_loss=0.3191, pruned_loss=0.09608, over 21710.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.356, pruned_loss=0.1224, over 4274310.03 frames. ], batch size: 282, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:27:19,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=200070.0, ans=0.015 2023-06-18 18:27:56,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-18 18:27:59,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=200190.0, ans=0.125 2023-06-18 18:29:23,632 INFO [train.py:996] (0/4) Epoch 2, batch 2900, loss[loss=0.3391, simple_loss=0.3772, pruned_loss=0.1504, over 21545.00 frames. ], tot_loss[loss=0.296, simple_loss=0.35, pruned_loss=0.121, over 4274663.83 frames. ], batch size: 471, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:29:24,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=200370.0, ans=0.125 2023-06-18 18:30:23,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=200490.0, ans=0.125 2023-06-18 18:30:44,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=200550.0, ans=0.0 2023-06-18 18:30:59,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-18 18:31:26,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.217e+02 3.802e+02 4.716e+02 7.016e+02, threshold=7.604e+02, percent-clipped=0.0 2023-06-18 18:31:31,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=200610.0, ans=0.0 2023-06-18 18:31:34,049 INFO [train.py:996] (0/4) Epoch 2, batch 2950, loss[loss=0.2746, simple_loss=0.358, pruned_loss=0.09558, over 21819.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3515, pruned_loss=0.121, over 4277104.30 frames. ], batch size: 332, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:31:54,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-18 18:33:03,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.46 vs. limit=15.0 2023-06-18 18:33:11,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=200850.0, ans=0.125 2023-06-18 18:33:14,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-06-18 18:33:43,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2023-06-18 18:34:01,563 INFO [train.py:996] (0/4) Epoch 2, batch 3000, loss[loss=0.2714, simple_loss=0.3526, pruned_loss=0.09515, over 21682.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3545, pruned_loss=0.1211, over 4279557.44 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:34:01,564 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 18:34:49,992 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.277, simple_loss=0.3697, pruned_loss=0.09215, over 1796401.00 frames. 2023-06-18 18:34:49,992 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 18:35:08,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=200970.0, ans=0.0 2023-06-18 18:35:32,946 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:36:09,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=201150.0, ans=0.2 2023-06-18 18:36:50,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.746e+02 3.199e+02 3.819e+02 7.191e+02, threshold=6.398e+02, percent-clipped=0.0 2023-06-18 18:37:02,544 INFO [train.py:996] (0/4) Epoch 2, batch 3050, loss[loss=0.2946, simple_loss=0.356, pruned_loss=0.1166, over 21737.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3556, pruned_loss=0.1191, over 4281324.18 frames. ], batch size: 414, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:37:53,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=201390.0, ans=0.1 2023-06-18 18:38:04,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=201390.0, ans=0.125 2023-06-18 18:38:22,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=201450.0, ans=0.125 2023-06-18 18:38:59,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=201510.0, ans=0.125 2023-06-18 18:39:10,278 INFO [train.py:996] (0/4) Epoch 2, batch 3100, loss[loss=0.2202, simple_loss=0.2854, pruned_loss=0.0775, over 21285.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.355, pruned_loss=0.1174, over 4285326.56 frames. ], batch size: 159, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 18:39:15,066 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:39:32,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=201630.0, ans=0.125 2023-06-18 18:39:51,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=201630.0, ans=0.125 2023-06-18 18:40:14,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=22.5 2023-06-18 18:40:25,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=201750.0, ans=0.0 2023-06-18 18:41:08,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.803e+02 3.511e+02 4.276e+02 6.392e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-18 18:41:28,769 INFO [train.py:996] (0/4) Epoch 2, batch 3150, loss[loss=0.3038, simple_loss=0.3522, pruned_loss=0.1276, over 20946.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3569, pruned_loss=0.1175, over 4289505.44 frames. ], batch size: 608, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:42:36,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=201990.0, ans=0.05 2023-06-18 18:42:37,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=201990.0, ans=0.125 2023-06-18 18:42:37,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=201990.0, ans=0.125 2023-06-18 18:43:04,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=202050.0, ans=0.125 2023-06-18 18:43:16,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=202110.0, ans=0.125 2023-06-18 18:43:38,667 INFO [train.py:996] (0/4) Epoch 2, batch 3200, loss[loss=0.311, simple_loss=0.3658, pruned_loss=0.1281, over 21838.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3602, pruned_loss=0.12, over 4285416.87 frames. ], batch size: 118, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:44:01,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=202170.0, ans=0.04949747468305833 2023-06-18 18:44:37,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-18 18:45:50,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.485e+02 4.090e+02 5.140e+02 8.255e+02, threshold=8.180e+02, percent-clipped=4.0 2023-06-18 18:45:50,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=202410.0, ans=0.2 2023-06-18 18:46:02,310 INFO [train.py:996] (0/4) Epoch 2, batch 3250, loss[loss=0.2932, simple_loss=0.3347, pruned_loss=0.1259, over 21685.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3631, pruned_loss=0.1219, over 4288512.48 frames. ], batch size: 282, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:46:07,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.31 vs. limit=15.0 2023-06-18 18:46:25,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=202530.0, ans=0.0 2023-06-18 18:48:07,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=202710.0, ans=0.125 2023-06-18 18:48:10,953 INFO [train.py:996] (0/4) Epoch 2, batch 3300, loss[loss=0.2454, simple_loss=0.3026, pruned_loss=0.09413, over 21526.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3596, pruned_loss=0.1219, over 4273351.41 frames. ], batch size: 263, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:50:26,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.159e+02 3.765e+02 4.410e+02 7.992e+02, threshold=7.530e+02, percent-clipped=0.0 2023-06-18 18:50:32,590 INFO [train.py:996] (0/4) Epoch 2, batch 3350, loss[loss=0.383, simple_loss=0.4352, pruned_loss=0.1654, over 21463.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3645, pruned_loss=0.1224, over 4270649.63 frames. ], batch size: 471, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:50:34,473 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:50:43,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=203070.0, ans=0.0 2023-06-18 18:50:44,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=203070.0, ans=0.125 2023-06-18 18:51:36,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=203190.0, ans=0.125 2023-06-18 18:52:47,714 INFO [train.py:996] (0/4) Epoch 2, batch 3400, loss[loss=0.283, simple_loss=0.3436, pruned_loss=0.1112, over 21870.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3644, pruned_loss=0.1237, over 4273767.68 frames. ], batch size: 118, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:53:23,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=203430.0, ans=0.125 2023-06-18 18:53:58,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=203490.0, ans=0.0 2023-06-18 18:54:26,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=203550.0, ans=0.2 2023-06-18 18:54:52,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.921e+02 3.464e+02 4.240e+02 8.326e+02, threshold=6.928e+02, percent-clipped=2.0 2023-06-18 18:54:58,314 INFO [train.py:996] (0/4) Epoch 2, batch 3450, loss[loss=0.3725, simple_loss=0.3867, pruned_loss=0.1791, over 21377.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3601, pruned_loss=0.1233, over 4273527.59 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:55:17,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=203670.0, ans=0.125 2023-06-18 18:55:26,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=203730.0, ans=0.125 2023-06-18 18:55:27,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-18 18:55:55,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-18 18:56:42,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=203850.0, ans=0.0 2023-06-18 18:56:51,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=203850.0, ans=0.0 2023-06-18 18:57:25,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203910.0, ans=0.1 2023-06-18 18:57:28,385 INFO [train.py:996] (0/4) Epoch 2, batch 3500, loss[loss=0.2631, simple_loss=0.3715, pruned_loss=0.07735, over 20747.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.3659, pruned_loss=0.1263, over 4265806.69 frames. ], batch size: 608, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 18:58:23,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=204090.0, ans=0.07 2023-06-18 18:58:55,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204150.0, ans=0.1 2023-06-18 18:59:25,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.105e+02 3.610e+02 4.528e+02 8.050e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-18 18:59:46,323 INFO [train.py:996] (0/4) Epoch 2, batch 3550, loss[loss=0.3028, simple_loss=0.3472, pruned_loss=0.1292, over 21773.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3694, pruned_loss=0.1286, over 4266301.67 frames. ], batch size: 351, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:01:43,299 INFO [train.py:996] (0/4) Epoch 2, batch 3600, loss[loss=0.2953, simple_loss=0.3415, pruned_loss=0.1246, over 21731.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3615, pruned_loss=0.1263, over 4262593.10 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:02:10,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=204570.0, ans=0.125 2023-06-18 19:03:04,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=204690.0, ans=0.125 2023-06-18 19:03:58,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.966e+02 3.675e+02 4.582e+02 9.024e+02, threshold=7.350e+02, percent-clipped=3.0 2023-06-18 19:04:11,489 INFO [train.py:996] (0/4) Epoch 2, batch 3650, loss[loss=0.2357, simple_loss=0.3109, pruned_loss=0.08025, over 21749.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3629, pruned_loss=0.1271, over 4261478.45 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:04:52,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=204930.0, ans=0.125 2023-06-18 19:05:25,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=204990.0, ans=0.125 2023-06-18 19:06:10,720 INFO [train.py:996] (0/4) Epoch 2, batch 3700, loss[loss=0.2748, simple_loss=0.3316, pruned_loss=0.109, over 21732.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3591, pruned_loss=0.1252, over 4271414.38 frames. ], batch size: 230, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:06:50,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=205230.0, ans=0.0 2023-06-18 19:06:53,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=205230.0, ans=0.125 2023-06-18 19:08:06,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=205410.0, ans=0.0 2023-06-18 19:08:06,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-18 19:08:22,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.719e+02 3.274e+02 3.792e+02 6.740e+02, threshold=6.549e+02, percent-clipped=0.0 2023-06-18 19:08:33,924 INFO [train.py:996] (0/4) Epoch 2, batch 3750, loss[loss=0.2524, simple_loss=0.3196, pruned_loss=0.09254, over 21859.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3555, pruned_loss=0.1236, over 4281668.91 frames. ], batch size: 333, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:08:43,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205470.0, ans=0.1 2023-06-18 19:09:01,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=205530.0, ans=0.5 2023-06-18 19:10:11,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-18 19:10:25,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=205710.0, ans=0.125 2023-06-18 19:11:06,933 INFO [train.py:996] (0/4) Epoch 2, batch 3800, loss[loss=0.2686, simple_loss=0.3189, pruned_loss=0.1091, over 21659.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3539, pruned_loss=0.1216, over 4282735.61 frames. ], batch size: 112, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 19:11:33,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=205830.0, ans=0.125 2023-06-18 19:12:04,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=205950.0, ans=0.125 2023-06-18 19:12:43,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-18 19:12:47,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=206010.0, ans=0.0 2023-06-18 19:12:49,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 3.000e+02 3.869e+02 4.812e+02 8.338e+02, threshold=7.738e+02, percent-clipped=10.0 2023-06-18 19:12:59,568 INFO [train.py:996] (0/4) Epoch 2, batch 3850, loss[loss=0.2461, simple_loss=0.2967, pruned_loss=0.09777, over 21335.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3526, pruned_loss=0.1228, over 4268407.77 frames. ], batch size: 211, lr: 1.90e-02, grad_scale: 16.0 2023-06-18 19:13:18,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=15.0 2023-06-18 19:14:19,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=206250.0, ans=0.125 2023-06-18 19:14:52,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=206310.0, ans=0.0 2023-06-18 19:14:54,899 INFO [train.py:996] (0/4) Epoch 2, batch 3900, loss[loss=0.2643, simple_loss=0.32, pruned_loss=0.1042, over 21534.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3488, pruned_loss=0.1224, over 4259674.33 frames. ], batch size: 195, lr: 1.89e-02, grad_scale: 16.0 2023-06-18 19:15:08,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=206370.0, ans=0.0 2023-06-18 19:15:22,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-18 19:17:07,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.842e+02 3.227e+02 4.001e+02 6.107e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-18 19:17:23,796 INFO [train.py:996] (0/4) Epoch 2, batch 3950, loss[loss=0.3103, simple_loss=0.3872, pruned_loss=0.1167, over 21631.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3529, pruned_loss=0.1214, over 4265250.41 frames. ], batch size: 389, lr: 1.89e-02, grad_scale: 16.0 2023-06-18 19:17:25,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=206670.0, ans=0.125 2023-06-18 19:18:16,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206790.0, ans=0.125 2023-06-18 19:18:48,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=206850.0, ans=0.0 2023-06-18 19:19:07,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206910.0, ans=0.125 2023-06-18 19:19:28,695 INFO [train.py:996] (0/4) Epoch 2, batch 4000, loss[loss=0.2757, simple_loss=0.3277, pruned_loss=0.1118, over 21862.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3455, pruned_loss=0.1179, over 4266067.75 frames. ], batch size: 98, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:19:36,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206970.0, ans=0.125 2023-06-18 19:19:42,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=206970.0, ans=0.125 2023-06-18 19:20:06,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2023-06-18 19:20:13,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=207090.0, ans=0.2 2023-06-18 19:20:55,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=207150.0, ans=0.0 2023-06-18 19:20:59,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=207150.0, ans=0.125 2023-06-18 19:21:38,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.725e+02 3.231e+02 3.855e+02 7.242e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-18 19:21:42,611 INFO [train.py:996] (0/4) Epoch 2, batch 4050, loss[loss=0.2796, simple_loss=0.3458, pruned_loss=0.1067, over 21797.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3442, pruned_loss=0.1152, over 4275344.77 frames. ], batch size: 332, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:23:53,143 INFO [train.py:996] (0/4) Epoch 2, batch 4100, loss[loss=0.2576, simple_loss=0.3298, pruned_loss=0.09275, over 21563.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3456, pruned_loss=0.1156, over 4281010.81 frames. ], batch size: 195, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:25:31,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=207750.0, ans=10.0 2023-06-18 19:25:49,502 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.985e+02 3.648e+02 4.743e+02 7.155e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-18 19:26:12,885 INFO [train.py:996] (0/4) Epoch 2, batch 4150, loss[loss=0.2346, simple_loss=0.3221, pruned_loss=0.07359, over 21343.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3459, pruned_loss=0.111, over 4279423.29 frames. ], batch size: 194, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:26:29,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=207930.0, ans=10.0 2023-06-18 19:27:23,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-18 19:27:49,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=208110.0, ans=0.0 2023-06-18 19:27:52,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=208110.0, ans=0.125 2023-06-18 19:28:01,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=208110.0, ans=0.1 2023-06-18 19:28:14,150 INFO [train.py:996] (0/4) Epoch 2, batch 4200, loss[loss=0.268, simple_loss=0.3236, pruned_loss=0.1062, over 21359.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3445, pruned_loss=0.111, over 4276532.11 frames. ], batch size: 143, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:29:17,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=208290.0, ans=0.125 2023-06-18 19:29:19,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=208290.0, ans=0.0 2023-06-18 19:29:40,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=208350.0, ans=0.95 2023-06-18 19:30:28,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 3.201e+02 3.847e+02 4.698e+02 6.915e+02, threshold=7.694e+02, percent-clipped=0.0 2023-06-18 19:30:32,577 INFO [train.py:996] (0/4) Epoch 2, batch 4250, loss[loss=0.3504, simple_loss=0.4096, pruned_loss=0.1456, over 21587.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3521, pruned_loss=0.1139, over 4278406.68 frames. ], batch size: 414, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 19:30:33,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208470.0, ans=0.1 2023-06-18 19:31:00,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-18 19:31:11,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=208530.0, ans=0.0 2023-06-18 19:32:09,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=208650.0, ans=0.2 2023-06-18 19:32:21,403 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:32:34,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=208710.0, ans=0.2 2023-06-18 19:32:35,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=208710.0, ans=0.125 2023-06-18 19:32:47,718 INFO [train.py:996] (0/4) Epoch 2, batch 4300, loss[loss=0.2501, simple_loss=0.327, pruned_loss=0.08658, over 21606.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3589, pruned_loss=0.1173, over 4273995.28 frames. ], batch size: 230, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:34:10,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=208890.0, ans=0.125 2023-06-18 19:34:40,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=208950.0, ans=0.0 2023-06-18 19:34:44,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=209010.0, ans=0.125 2023-06-18 19:35:14,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 3.172e+02 3.778e+02 4.372e+02 7.557e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-18 19:35:25,695 INFO [train.py:996] (0/4) Epoch 2, batch 4350, loss[loss=0.2617, simple_loss=0.3034, pruned_loss=0.11, over 21187.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3558, pruned_loss=0.1162, over 4265610.28 frames. ], batch size: 159, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:35:30,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-18 19:35:53,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.80 vs. limit=22.5 2023-06-18 19:36:43,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=209250.0, ans=0.125 2023-06-18 19:37:19,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-18 19:37:38,251 INFO [train.py:996] (0/4) Epoch 2, batch 4400, loss[loss=0.2509, simple_loss=0.3015, pruned_loss=0.1001, over 21510.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3513, pruned_loss=0.1156, over 4262153.23 frames. ], batch size: 230, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:37:43,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=209370.0, ans=0.04949747468305833 2023-06-18 19:38:45,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=209490.0, ans=0.0 2023-06-18 19:39:25,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=209550.0, ans=0.07 2023-06-18 19:39:41,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-18 19:39:47,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.882e+02 3.350e+02 4.158e+02 7.095e+02, threshold=6.699e+02, percent-clipped=0.0 2023-06-18 19:39:57,974 INFO [train.py:996] (0/4) Epoch 2, batch 4450, loss[loss=0.2705, simple_loss=0.3317, pruned_loss=0.1046, over 21926.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3615, pruned_loss=0.118, over 4265602.74 frames. ], batch size: 107, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:40:17,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=209670.0, ans=0.125 2023-06-18 19:40:42,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=209730.0, ans=0.125 2023-06-18 19:40:43,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=209730.0, ans=0.125 2023-06-18 19:41:15,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=209850.0, ans=0.125 2023-06-18 19:41:26,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=209850.0, ans=0.125 2023-06-18 19:42:15,035 INFO [train.py:996] (0/4) Epoch 2, batch 4500, loss[loss=0.2715, simple_loss=0.3393, pruned_loss=0.1018, over 21816.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3626, pruned_loss=0.1196, over 4274535.43 frames. ], batch size: 282, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:42:28,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=209970.0, ans=0.07 2023-06-18 19:42:39,349 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:43:39,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210150.0, ans=0.125 2023-06-18 19:44:15,229 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:44:16,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.865e+02 3.524e+02 4.358e+02 7.119e+02, threshold=7.048e+02, percent-clipped=2.0 2023-06-18 19:44:38,598 INFO [train.py:996] (0/4) Epoch 2, batch 4550, loss[loss=0.3731, simple_loss=0.4183, pruned_loss=0.1639, over 21565.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.365, pruned_loss=0.1196, over 4274628.86 frames. ], batch size: 414, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:45:09,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.44 vs. limit=15.0 2023-06-18 19:45:30,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=210390.0, ans=10.0 2023-06-18 19:45:30,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.11 vs. limit=15.0 2023-06-18 19:46:49,060 INFO [train.py:996] (0/4) Epoch 2, batch 4600, loss[loss=0.2822, simple_loss=0.345, pruned_loss=0.1097, over 21745.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3698, pruned_loss=0.1229, over 4281655.09 frames. ], batch size: 112, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:46:52,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=210570.0, ans=0.0 2023-06-18 19:46:52,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=210570.0, ans=0.2 2023-06-18 19:48:19,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=210750.0, ans=0.0 2023-06-18 19:48:53,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=210810.0, ans=0.0 2023-06-18 19:48:59,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.115e+02 3.942e+02 4.901e+02 8.215e+02, threshold=7.883e+02, percent-clipped=5.0 2023-06-18 19:49:07,879 INFO [train.py:996] (0/4) Epoch 2, batch 4650, loss[loss=0.2674, simple_loss=0.3194, pruned_loss=0.1077, over 21662.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3659, pruned_loss=0.12, over 4283834.50 frames. ], batch size: 414, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 19:49:14,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=210870.0, ans=0.125 2023-06-18 19:50:27,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=210990.0, ans=0.125 2023-06-18 19:50:37,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=211050.0, ans=0.015 2023-06-18 19:51:18,614 INFO [train.py:996] (0/4) Epoch 2, batch 4700, loss[loss=0.2611, simple_loss=0.3096, pruned_loss=0.1062, over 21326.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3552, pruned_loss=0.1172, over 4290048.05 frames. ], batch size: 144, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:51:19,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=211170.0, ans=0.125 2023-06-18 19:51:28,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=211170.0, ans=0.125 2023-06-18 19:51:33,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-18 19:51:37,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-18 19:53:02,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=211410.0, ans=0.0 2023-06-18 19:53:10,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211410.0, ans=0.125 2023-06-18 19:53:12,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.814e+02 3.366e+02 4.246e+02 7.056e+02, threshold=6.733e+02, percent-clipped=0.0 2023-06-18 19:53:17,074 INFO [train.py:996] (0/4) Epoch 2, batch 4750, loss[loss=0.3445, simple_loss=0.367, pruned_loss=0.161, over 21405.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3489, pruned_loss=0.1166, over 4291421.05 frames. ], batch size: 473, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:54:52,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211650.0, ans=0.1 2023-06-18 19:55:31,823 INFO [train.py:996] (0/4) Epoch 2, batch 4800, loss[loss=0.295, simple_loss=0.3801, pruned_loss=0.1049, over 21704.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3513, pruned_loss=0.1176, over 4291391.36 frames. ], batch size: 389, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:56:41,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=211950.0, ans=0.0 2023-06-18 19:56:58,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=212010.0, ans=0.125 2023-06-18 19:57:14,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=212010.0, ans=0.125 2023-06-18 19:57:24,063 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.885e+02 3.329e+02 4.193e+02 7.094e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-18 19:57:28,263 INFO [train.py:996] (0/4) Epoch 2, batch 4850, loss[loss=0.3346, simple_loss=0.378, pruned_loss=0.1456, over 21756.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.348, pruned_loss=0.1165, over 4282872.14 frames. ], batch size: 441, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 19:57:51,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=212070.0, ans=0.0 2023-06-18 19:57:54,140 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:58:57,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212250.0, ans=0.1 2023-06-18 19:59:17,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-18 19:59:36,626 INFO [train.py:996] (0/4) Epoch 2, batch 4900, loss[loss=0.3437, simple_loss=0.3925, pruned_loss=0.1474, over 21318.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.354, pruned_loss=0.1196, over 4280529.44 frames. ], batch size: 548, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:00:27,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=212430.0, ans=0.125 2023-06-18 20:00:37,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=212490.0, ans=0.0 2023-06-18 20:01:48,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.116e+02 3.732e+02 4.478e+02 7.040e+02, threshold=7.463e+02, percent-clipped=1.0 2023-06-18 20:01:53,123 INFO [train.py:996] (0/4) Epoch 2, batch 4950, loss[loss=0.2693, simple_loss=0.36, pruned_loss=0.08927, over 21618.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3536, pruned_loss=0.1152, over 4275908.64 frames. ], batch size: 389, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:03:03,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=212790.0, ans=0.2 2023-06-18 20:03:12,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=212790.0, ans=0.125 2023-06-18 20:03:59,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=212910.0, ans=0.125 2023-06-18 20:03:59,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=212910.0, ans=0.2 2023-06-18 20:04:13,541 INFO [train.py:996] (0/4) Epoch 2, batch 5000, loss[loss=0.3316, simple_loss=0.3808, pruned_loss=0.1412, over 21937.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3509, pruned_loss=0.111, over 4275115.78 frames. ], batch size: 113, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:05:14,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=213090.0, ans=0.07 2023-06-18 20:05:38,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=213150.0, ans=0.0 2023-06-18 20:05:42,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-18 20:05:43,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-18 20:06:09,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.689e+02 3.169e+02 3.752e+02 6.029e+02, threshold=6.337e+02, percent-clipped=0.0 2023-06-18 20:06:20,640 INFO [train.py:996] (0/4) Epoch 2, batch 5050, loss[loss=0.295, simple_loss=0.3472, pruned_loss=0.1214, over 21854.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3516, pruned_loss=0.115, over 4282445.02 frames. ], batch size: 118, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 20:07:09,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.19 vs. limit=15.0 2023-06-18 20:07:13,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-18 20:07:17,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.40 vs. limit=5.0 2023-06-18 20:08:09,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=22.5 2023-06-18 20:08:32,004 INFO [train.py:996] (0/4) Epoch 2, batch 5100, loss[loss=0.2703, simple_loss=0.3304, pruned_loss=0.1052, over 21843.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3496, pruned_loss=0.1156, over 4285389.51 frames. ], batch size: 351, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:08:43,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=213570.0, ans=0.125 2023-06-18 20:09:22,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=213690.0, ans=0.0 2023-06-18 20:09:23,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-18 20:10:14,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=213750.0, ans=0.125 2023-06-18 20:10:26,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=213810.0, ans=0.125 2023-06-18 20:10:28,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.02 vs. limit=12.0 2023-06-18 20:10:32,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213810.0, ans=0.1 2023-06-18 20:10:34,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.968e+02 3.545e+02 4.685e+02 7.236e+02, threshold=7.090e+02, percent-clipped=4.0 2023-06-18 20:10:43,528 INFO [train.py:996] (0/4) Epoch 2, batch 5150, loss[loss=0.2758, simple_loss=0.328, pruned_loss=0.1118, over 21892.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3473, pruned_loss=0.1159, over 4295915.77 frames. ], batch size: 316, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:11:05,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=213870.0, ans=0.125 2023-06-18 20:11:40,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=213990.0, ans=0.0 2023-06-18 20:12:04,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=213990.0, ans=0.0 2023-06-18 20:12:46,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=15.0 2023-06-18 20:13:01,832 INFO [train.py:996] (0/4) Epoch 2, batch 5200, loss[loss=0.3066, simple_loss=0.3901, pruned_loss=0.1115, over 21847.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3465, pruned_loss=0.115, over 4297255.08 frames. ], batch size: 316, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:13:09,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-18 20:13:39,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=214230.0, ans=0.035 2023-06-18 20:13:54,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=214290.0, ans=0.0 2023-06-18 20:14:59,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.964e+02 3.541e+02 4.269e+02 6.155e+02, threshold=7.082e+02, percent-clipped=0.0 2023-06-18 20:15:00,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=214410.0, ans=0.125 2023-06-18 20:15:04,655 INFO [train.py:996] (0/4) Epoch 2, batch 5250, loss[loss=0.2456, simple_loss=0.3268, pruned_loss=0.08222, over 21389.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3479, pruned_loss=0.112, over 4289541.20 frames. ], batch size: 211, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:15:40,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=214470.0, ans=0.125 2023-06-18 20:16:05,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214530.0, ans=0.1 2023-06-18 20:16:37,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=214590.0, ans=0.0 2023-06-18 20:16:54,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=214650.0, ans=0.0 2023-06-18 20:17:18,946 INFO [train.py:996] (0/4) Epoch 2, batch 5300, loss[loss=0.3211, simple_loss=0.3654, pruned_loss=0.1384, over 21933.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3483, pruned_loss=0.1144, over 4295014.17 frames. ], batch size: 414, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:17:51,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-18 20:17:57,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=214830.0, ans=0.125 2023-06-18 20:18:35,596 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:19:14,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-18 20:19:17,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.854e+02 3.243e+02 3.685e+02 6.916e+02, threshold=6.486e+02, percent-clipped=0.0 2023-06-18 20:19:21,305 INFO [train.py:996] (0/4) Epoch 2, batch 5350, loss[loss=0.3011, simple_loss=0.3553, pruned_loss=0.1234, over 21780.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3476, pruned_loss=0.1154, over 4298602.90 frames. ], batch size: 112, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:19:41,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=215070.0, ans=0.2 2023-06-18 20:19:43,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=215070.0, ans=0.125 2023-06-18 20:20:26,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215190.0, ans=0.1 2023-06-18 20:21:41,611 INFO [train.py:996] (0/4) Epoch 2, batch 5400, loss[loss=0.2452, simple_loss=0.3137, pruned_loss=0.08831, over 21531.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3466, pruned_loss=0.1165, over 4299111.67 frames. ], batch size: 212, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:22:16,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215430.0, ans=0.1 2023-06-18 20:22:55,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-18 20:23:07,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=215490.0, ans=0.0 2023-06-18 20:23:43,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.870e+02 3.526e+02 4.433e+02 7.880e+02, threshold=7.051e+02, percent-clipped=2.0 2023-06-18 20:24:07,769 INFO [train.py:996] (0/4) Epoch 2, batch 5450, loss[loss=0.3012, simple_loss=0.3759, pruned_loss=0.1133, over 21854.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.349, pruned_loss=0.1142, over 4295102.37 frames. ], batch size: 371, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 20:24:11,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=215670.0, ans=0.0 2023-06-18 20:25:15,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.23 vs. limit=15.0 2023-06-18 20:25:22,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=215790.0, ans=0.2 2023-06-18 20:25:31,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=215850.0, ans=0.07 2023-06-18 20:25:44,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=215850.0, ans=22.5 2023-06-18 20:26:18,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=215910.0, ans=0.125 2023-06-18 20:26:21,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=215910.0, ans=0.0 2023-06-18 20:26:26,986 INFO [train.py:996] (0/4) Epoch 2, batch 5500, loss[loss=0.2226, simple_loss=0.3066, pruned_loss=0.06927, over 21178.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3528, pruned_loss=0.1098, over 4294219.62 frames. ], batch size: 176, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:26:28,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-18 20:26:33,309 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-36000.pt 2023-06-18 20:26:45,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-18 20:27:18,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=216090.0, ans=0.0 2023-06-18 20:27:30,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=216090.0, ans=0.0 2023-06-18 20:27:51,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=216150.0, ans=0.125 2023-06-18 20:28:51,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.788e+02 3.266e+02 4.093e+02 7.420e+02, threshold=6.532e+02, percent-clipped=2.0 2023-06-18 20:29:02,090 INFO [train.py:996] (0/4) Epoch 2, batch 5550, loss[loss=0.195, simple_loss=0.2765, pruned_loss=0.05676, over 21430.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3526, pruned_loss=0.1072, over 4293963.38 frames. ], batch size: 194, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:29:41,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=15.0 2023-06-18 20:31:13,792 INFO [train.py:996] (0/4) Epoch 2, batch 5600, loss[loss=0.2743, simple_loss=0.3537, pruned_loss=0.09746, over 21398.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3469, pruned_loss=0.1023, over 4285918.45 frames. ], batch size: 211, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:31:25,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=216570.0, ans=0.125 2023-06-18 20:31:37,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216630.0, ans=0.1 2023-06-18 20:32:24,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=216690.0, ans=0.125 2023-06-18 20:32:35,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=216750.0, ans=0.125 2023-06-18 20:32:35,966 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:33:05,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.046e+02 3.574e+02 4.352e+02 8.183e+02, threshold=7.147e+02, percent-clipped=1.0 2023-06-18 20:33:20,520 INFO [train.py:996] (0/4) Epoch 2, batch 5650, loss[loss=0.2885, simple_loss=0.3402, pruned_loss=0.1184, over 21868.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3508, pruned_loss=0.1047, over 4282876.90 frames. ], batch size: 282, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:33:51,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-18 20:34:33,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=217050.0, ans=0.2 2023-06-18 20:34:46,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.04 vs. limit=15.0 2023-06-18 20:35:31,747 INFO [train.py:996] (0/4) Epoch 2, batch 5700, loss[loss=0.2609, simple_loss=0.3362, pruned_loss=0.09281, over 21726.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3525, pruned_loss=0.1083, over 4289115.78 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:35:58,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=217170.0, ans=0.125 2023-06-18 20:36:10,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=217230.0, ans=0.0 2023-06-18 20:36:12,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-18 20:36:20,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=217230.0, ans=0.5 2023-06-18 20:37:02,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=217290.0, ans=0.0 2023-06-18 20:37:51,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.053e+02 4.055e+02 5.336e+02 8.968e+02, threshold=8.109e+02, percent-clipped=6.0 2023-06-18 20:37:55,691 INFO [train.py:996] (0/4) Epoch 2, batch 5750, loss[loss=0.231, simple_loss=0.3153, pruned_loss=0.07334, over 21409.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3508, pruned_loss=0.1055, over 4287888.51 frames. ], batch size: 211, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:38:29,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-06-18 20:38:49,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=217590.0, ans=0.2 2023-06-18 20:38:49,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=217590.0, ans=0.0 2023-06-18 20:39:11,447 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:40:27,367 INFO [train.py:996] (0/4) Epoch 2, batch 5800, loss[loss=0.3629, simple_loss=0.4324, pruned_loss=0.1467, over 21512.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3482, pruned_loss=0.104, over 4277246.77 frames. ], batch size: 471, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 20:42:37,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 2.376e+02 3.080e+02 4.315e+02 9.402e+02, threshold=6.161e+02, percent-clipped=2.0 2023-06-18 20:42:41,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=218070.0, ans=0.0 2023-06-18 20:42:41,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=218070.0, ans=0.125 2023-06-18 20:42:42,062 INFO [train.py:996] (0/4) Epoch 2, batch 5850, loss[loss=0.2804, simple_loss=0.3753, pruned_loss=0.09273, over 21248.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3455, pruned_loss=0.09915, over 4279839.78 frames. ], batch size: 548, lr: 1.85e-02, grad_scale: 64.0 2023-06-18 20:43:07,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=218130.0, ans=0.125 2023-06-18 20:43:42,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=218190.0, ans=0.125 2023-06-18 20:44:21,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=218250.0, ans=0.125 2023-06-18 20:44:24,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=218310.0, ans=0.125 2023-06-18 20:44:38,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=218310.0, ans=0.125 2023-06-18 20:44:41,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=218310.0, ans=0.0 2023-06-18 20:44:52,892 INFO [train.py:996] (0/4) Epoch 2, batch 5900, loss[loss=0.2507, simple_loss=0.3452, pruned_loss=0.07808, over 21528.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3348, pruned_loss=0.0907, over 4276004.90 frames. ], batch size: 471, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:46:03,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-18 20:46:11,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=218550.0, ans=0.035 2023-06-18 20:46:12,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=218550.0, ans=0.125 2023-06-18 20:46:25,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-18 20:46:52,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 2.321e+02 3.028e+02 3.868e+02 8.968e+02, threshold=6.057e+02, percent-clipped=3.0 2023-06-18 20:46:53,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=218610.0, ans=0.125 2023-06-18 20:46:57,518 INFO [train.py:996] (0/4) Epoch 2, batch 5950, loss[loss=0.3153, simple_loss=0.3595, pruned_loss=0.1355, over 21882.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3368, pruned_loss=0.09591, over 4282052.12 frames. ], batch size: 351, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:47:07,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=218670.0, ans=0.125 2023-06-18 20:47:48,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=218790.0, ans=0.5 2023-06-18 20:48:52,201 INFO [train.py:996] (0/4) Epoch 2, batch 6000, loss[loss=0.264, simple_loss=0.305, pruned_loss=0.1116, over 20174.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3347, pruned_loss=0.101, over 4263938.74 frames. ], batch size: 702, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:48:52,203 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 20:49:47,779 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2855, simple_loss=0.3796, pruned_loss=0.09574, over 1796401.00 frames. 2023-06-18 20:49:47,782 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 20:50:57,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=219150.0, ans=0.125 2023-06-18 20:51:00,074 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:51:17,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=219210.0, ans=0.5 2023-06-18 20:51:42,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.988e+02 3.378e+02 4.160e+02 8.273e+02, threshold=6.755e+02, percent-clipped=6.0 2023-06-18 20:51:45,876 INFO [train.py:996] (0/4) Epoch 2, batch 6050, loss[loss=0.2378, simple_loss=0.2972, pruned_loss=0.08924, over 21652.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3306, pruned_loss=0.1025, over 4267502.03 frames. ], batch size: 298, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 20:52:29,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-18 20:52:38,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=219390.0, ans=0.035 2023-06-18 20:52:45,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=219390.0, ans=0.125 2023-06-18 20:52:49,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=219390.0, ans=0.04949747468305833 2023-06-18 20:53:19,329 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:53:44,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=219570.0, ans=0.0 2023-06-18 20:53:45,913 INFO [train.py:996] (0/4) Epoch 2, batch 6100, loss[loss=0.2781, simple_loss=0.334, pruned_loss=0.1111, over 21489.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3281, pruned_loss=0.1002, over 4263279.67 frames. ], batch size: 548, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:54:13,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-18 20:54:26,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=219630.0, ans=0.125 2023-06-18 20:54:45,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-18 20:54:52,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=219690.0, ans=0.0 2023-06-18 20:55:09,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=219750.0, ans=0.125 2023-06-18 20:55:20,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=8.0 2023-06-18 20:55:54,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 3.086e+02 4.066e+02 5.102e+02 8.027e+02, threshold=8.133e+02, percent-clipped=8.0 2023-06-18 20:55:56,250 INFO [train.py:996] (0/4) Epoch 2, batch 6150, loss[loss=0.3641, simple_loss=0.3856, pruned_loss=0.1713, over 21420.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.334, pruned_loss=0.1058, over 4266038.12 frames. ], batch size: 507, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:55:58,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-18 20:56:05,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=219870.0, ans=0.125 2023-06-18 20:56:13,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=15.0 2023-06-18 20:56:28,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-18 20:57:07,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=219990.0, ans=0.0 2023-06-18 20:58:07,513 INFO [train.py:996] (0/4) Epoch 2, batch 6200, loss[loss=0.2889, simple_loss=0.3537, pruned_loss=0.112, over 21855.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3368, pruned_loss=0.1065, over 4262650.25 frames. ], batch size: 316, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 20:58:30,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=220230.0, ans=0.0 2023-06-18 20:59:25,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=220290.0, ans=0.125 2023-06-18 21:00:27,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.590e+02 3.175e+02 3.801e+02 7.451e+02, threshold=6.350e+02, percent-clipped=0.0 2023-06-18 21:00:28,868 INFO [train.py:996] (0/4) Epoch 2, batch 6250, loss[loss=0.3829, simple_loss=0.4412, pruned_loss=0.1623, over 21448.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3416, pruned_loss=0.1064, over 4267551.69 frames. ], batch size: 507, lr: 1.84e-02, grad_scale: 16.0 2023-06-18 21:00:29,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=220470.0, ans=0.2 2023-06-18 21:00:37,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=220470.0, ans=0.125 2023-06-18 21:00:46,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220530.0, ans=0.1 2023-06-18 21:02:15,014 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:02:39,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=220710.0, ans=0.125 2023-06-18 21:02:43,590 INFO [train.py:996] (0/4) Epoch 2, batch 6300, loss[loss=0.3169, simple_loss=0.3632, pruned_loss=0.1353, over 21743.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.345, pruned_loss=0.1054, over 4278900.39 frames. ], batch size: 112, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:03:02,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=220830.0, ans=0.125 2023-06-18 21:03:53,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-18 21:04:33,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-18 21:04:34,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221010.0, ans=0.1 2023-06-18 21:04:35,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-18 21:04:39,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.847e+02 3.430e+02 4.458e+02 8.729e+02, threshold=6.860e+02, percent-clipped=4.0 2023-06-18 21:04:40,736 INFO [train.py:996] (0/4) Epoch 2, batch 6350, loss[loss=0.3435, simple_loss=0.3873, pruned_loss=0.1498, over 21943.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3514, pruned_loss=0.1113, over 4284787.31 frames. ], batch size: 372, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:05:00,739 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:05:43,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=221130.0, ans=0.0 2023-06-18 21:06:21,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=221250.0, ans=0.2 2023-06-18 21:06:32,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=221250.0, ans=0.0 2023-06-18 21:06:43,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=221310.0, ans=0.125 2023-06-18 21:07:00,534 INFO [train.py:996] (0/4) Epoch 2, batch 6400, loss[loss=0.2236, simple_loss=0.357, pruned_loss=0.04509, over 20801.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3562, pruned_loss=0.1145, over 4277030.88 frames. ], batch size: 607, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 21:07:40,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-18 21:07:59,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-18 21:08:16,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.95 vs. limit=6.0 2023-06-18 21:09:08,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=221610.0, ans=0.0 2023-06-18 21:09:22,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.900e+02 3.291e+02 3.845e+02 6.146e+02, threshold=6.583e+02, percent-clipped=0.0 2023-06-18 21:09:22,197 INFO [train.py:996] (0/4) Epoch 2, batch 6450, loss[loss=0.2622, simple_loss=0.3472, pruned_loss=0.08858, over 21746.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3583, pruned_loss=0.1147, over 4273767.35 frames. ], batch size: 332, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:09:46,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=221670.0, ans=0.125 2023-06-18 21:10:46,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-18 21:11:10,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=221910.0, ans=0.125 2023-06-18 21:11:10,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=221910.0, ans=0.125 2023-06-18 21:11:21,747 INFO [train.py:996] (0/4) Epoch 2, batch 6500, loss[loss=0.2715, simple_loss=0.3192, pruned_loss=0.1119, over 21764.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3487, pruned_loss=0.1124, over 4266272.88 frames. ], batch size: 124, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:11:31,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=221970.0, ans=0.125 2023-06-18 21:12:29,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=222090.0, ans=0.125 2023-06-18 21:12:39,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=222150.0, ans=0.125 2023-06-18 21:12:48,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-18 21:13:01,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=222150.0, ans=0.5 2023-06-18 21:13:40,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.666e+02 3.386e+02 4.253e+02 7.478e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-18 21:13:40,692 INFO [train.py:996] (0/4) Epoch 2, batch 6550, loss[loss=0.2495, simple_loss=0.2812, pruned_loss=0.1089, over 20986.00 frames. ], tot_loss[loss=0.283, simple_loss=0.344, pruned_loss=0.1109, over 4268797.72 frames. ], batch size: 613, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:13:45,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=222270.0, ans=0.0 2023-06-18 21:14:14,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222330.0, ans=0.1 2023-06-18 21:14:26,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=222330.0, ans=0.5 2023-06-18 21:14:31,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=222390.0, ans=0.125 2023-06-18 21:15:33,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=222510.0, ans=0.125 2023-06-18 21:15:40,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=222510.0, ans=0.0 2023-06-18 21:15:44,898 INFO [train.py:996] (0/4) Epoch 2, batch 6600, loss[loss=0.2229, simple_loss=0.2771, pruned_loss=0.0843, over 21555.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3402, pruned_loss=0.111, over 4268482.68 frames. ], batch size: 230, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:15:45,331 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:16:50,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=222690.0, ans=0.0 2023-06-18 21:16:52,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=222750.0, ans=0.125 2023-06-18 21:17:30,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-18 21:17:42,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.521e+02 3.069e+02 3.605e+02 6.340e+02, threshold=6.138e+02, percent-clipped=0.0 2023-06-18 21:17:42,557 INFO [train.py:996] (0/4) Epoch 2, batch 6650, loss[loss=0.2439, simple_loss=0.3072, pruned_loss=0.09036, over 21808.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3324, pruned_loss=0.1067, over 4277679.28 frames. ], batch size: 352, lr: 1.83e-02, grad_scale: 16.0 2023-06-18 21:18:12,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222870.0, ans=0.125 2023-06-18 21:19:12,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=223050.0, ans=0.125 2023-06-18 21:19:42,827 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:19:53,887 INFO [train.py:996] (0/4) Epoch 2, batch 6700, loss[loss=0.2836, simple_loss=0.3434, pruned_loss=0.1119, over 21740.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3297, pruned_loss=0.1069, over 4280419.16 frames. ], batch size: 333, lr: 1.82e-02, grad_scale: 16.0 2023-06-18 21:19:57,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=223170.0, ans=0.125 2023-06-18 21:20:20,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=223230.0, ans=0.125 2023-06-18 21:20:22,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-18 21:20:58,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=223290.0, ans=0.125 2023-06-18 21:21:04,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=223350.0, ans=0.125 2023-06-18 21:21:24,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=15.0 2023-06-18 21:21:25,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=223350.0, ans=0.125 2023-06-18 21:21:58,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223470.0, ans=0.1 2023-06-18 21:21:59,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-18 21:21:59,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.168e+02 3.639e+02 4.397e+02 6.633e+02, threshold=7.278e+02, percent-clipped=4.0 2023-06-18 21:21:59,681 INFO [train.py:996] (0/4) Epoch 2, batch 6750, loss[loss=0.3372, simple_loss=0.3522, pruned_loss=0.1611, over 21531.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.33, pruned_loss=0.1083, over 4281051.76 frames. ], batch size: 508, lr: 1.82e-02, grad_scale: 16.0 2023-06-18 21:22:23,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-18 21:22:35,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=223530.0, ans=0.2 2023-06-18 21:23:26,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=223650.0, ans=0.125 2023-06-18 21:24:15,893 INFO [train.py:996] (0/4) Epoch 2, batch 6800, loss[loss=0.2574, simple_loss=0.3126, pruned_loss=0.1011, over 21659.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3327, pruned_loss=0.1108, over 4290514.86 frames. ], batch size: 247, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:25:00,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=223890.0, ans=0.0 2023-06-18 21:25:46,875 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:26:06,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.622e+02 3.115e+02 3.857e+02 6.286e+02, threshold=6.230e+02, percent-clipped=0.0 2023-06-18 21:26:06,266 INFO [train.py:996] (0/4) Epoch 2, batch 6850, loss[loss=0.2924, simple_loss=0.3355, pruned_loss=0.1246, over 21861.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3306, pruned_loss=0.1122, over 4295340.14 frames. ], batch size: 351, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:26:10,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-18 21:26:45,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224130.0, ans=0.1 2023-06-18 21:27:42,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=224250.0, ans=0.125 2023-06-18 21:28:11,350 INFO [train.py:996] (0/4) Epoch 2, batch 6900, loss[loss=0.2252, simple_loss=0.3072, pruned_loss=0.07164, over 21770.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3314, pruned_loss=0.1125, over 4298327.45 frames. ], batch size: 247, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:28:44,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224370.0, ans=0.1 2023-06-18 21:30:44,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=224610.0, ans=0.125 2023-06-18 21:30:47,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.531e+02 2.951e+02 3.498e+02 5.383e+02, threshold=5.901e+02, percent-clipped=0.0 2023-06-18 21:30:47,639 INFO [train.py:996] (0/4) Epoch 2, batch 6950, loss[loss=0.2218, simple_loss=0.3094, pruned_loss=0.06706, over 21641.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3325, pruned_loss=0.1088, over 4297806.58 frames. ], batch size: 263, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:31:19,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224730.0, ans=0.1 2023-06-18 21:31:22,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=224790.0, ans=0.05 2023-06-18 21:31:41,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224850.0, ans=0.1 2023-06-18 21:32:21,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=224850.0, ans=15.0 2023-06-18 21:32:49,129 INFO [train.py:996] (0/4) Epoch 2, batch 7000, loss[loss=0.2482, simple_loss=0.291, pruned_loss=0.1027, over 21206.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3364, pruned_loss=0.1123, over 4288549.09 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:32:56,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-18 21:33:47,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=225090.0, ans=0.0 2023-06-18 21:34:24,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=225210.0, ans=0.035 2023-06-18 21:34:25,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-18 21:35:00,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.871e+02 3.500e+02 4.345e+02 7.832e+02, threshold=7.000e+02, percent-clipped=4.0 2023-06-18 21:35:00,474 INFO [train.py:996] (0/4) Epoch 2, batch 7050, loss[loss=0.2132, simple_loss=0.2618, pruned_loss=0.08231, over 16072.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3332, pruned_loss=0.1101, over 4276767.40 frames. ], batch size: 60, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:35:03,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=225270.0, ans=0.0 2023-06-18 21:35:52,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-18 21:35:57,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=225390.0, ans=0.0 2023-06-18 21:36:00,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-18 21:36:33,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=225450.0, ans=0.0 2023-06-18 21:37:13,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=225570.0, ans=0.07 2023-06-18 21:37:14,080 INFO [train.py:996] (0/4) Epoch 2, batch 7100, loss[loss=0.2596, simple_loss=0.3326, pruned_loss=0.0933, over 21791.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3391, pruned_loss=0.1123, over 4280014.93 frames. ], batch size: 282, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 21:38:58,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=225750.0, ans=0.125 2023-06-18 21:39:17,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.803e+02 3.580e+02 4.596e+02 1.041e+03, threshold=7.161e+02, percent-clipped=7.0 2023-06-18 21:39:17,196 INFO [train.py:996] (0/4) Epoch 2, batch 7150, loss[loss=0.3056, simple_loss=0.365, pruned_loss=0.1231, over 21407.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3364, pruned_loss=0.109, over 4274561.23 frames. ], batch size: 549, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:39:19,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=225870.0, ans=0.2 2023-06-18 21:40:10,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=225990.0, ans=0.125 2023-06-18 21:40:10,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=225990.0, ans=0.0 2023-06-18 21:41:05,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=226110.0, ans=0.07 2023-06-18 21:41:23,734 INFO [train.py:996] (0/4) Epoch 2, batch 7200, loss[loss=0.3322, simple_loss=0.3433, pruned_loss=0.1605, over 21456.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3414, pruned_loss=0.1137, over 4273760.36 frames. ], batch size: 510, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:41:25,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=226170.0, ans=0.125 2023-06-18 21:41:55,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-18 21:42:08,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226230.0, ans=0.1 2023-06-18 21:42:26,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-18 21:43:32,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.330e+02 3.074e+02 3.664e+02 4.641e+02 7.244e+02, threshold=7.329e+02, percent-clipped=2.0 2023-06-18 21:43:32,323 INFO [train.py:996] (0/4) Epoch 2, batch 7250, loss[loss=0.2722, simple_loss=0.3246, pruned_loss=0.1099, over 21744.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3356, pruned_loss=0.1129, over 4272860.70 frames. ], batch size: 112, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:44:34,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=226590.0, ans=0.125 2023-06-18 21:44:52,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-18 21:45:26,282 INFO [train.py:996] (0/4) Epoch 2, batch 7300, loss[loss=0.2412, simple_loss=0.2957, pruned_loss=0.09339, over 21817.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3299, pruned_loss=0.1119, over 4264249.92 frames. ], batch size: 318, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:46:26,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=226890.0, ans=0.125 2023-06-18 21:46:55,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=226950.0, ans=0.125 2023-06-18 21:46:55,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-18 21:46:57,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=15.0 2023-06-18 21:47:36,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.896e+02 3.515e+02 4.271e+02 6.710e+02, threshold=7.030e+02, percent-clipped=0.0 2023-06-18 21:47:36,850 INFO [train.py:996] (0/4) Epoch 2, batch 7350, loss[loss=0.3259, simple_loss=0.3745, pruned_loss=0.1386, over 21400.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.328, pruned_loss=0.1125, over 4253945.18 frames. ], batch size: 131, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:48:13,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 21:48:58,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=227250.0, ans=0.125 2023-06-18 21:49:05,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227250.0, ans=0.1 2023-06-18 21:49:49,001 INFO [train.py:996] (0/4) Epoch 2, batch 7400, loss[loss=0.2672, simple_loss=0.3515, pruned_loss=0.09147, over 21692.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3362, pruned_loss=0.1152, over 4257709.92 frames. ], batch size: 351, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:50:12,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.08 vs. limit=10.0 2023-06-18 21:51:18,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-18 21:51:46,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=227610.0, ans=0.125 2023-06-18 21:52:10,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.891e+02 3.433e+02 4.412e+02 7.257e+02, threshold=6.867e+02, percent-clipped=1.0 2023-06-18 21:52:10,505 INFO [train.py:996] (0/4) Epoch 2, batch 7450, loss[loss=0.2827, simple_loss=0.322, pruned_loss=0.1217, over 21248.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.334, pruned_loss=0.1123, over 4264208.45 frames. ], batch size: 159, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:52:28,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.19 vs. limit=22.5 2023-06-18 21:52:46,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227730.0, ans=0.1 2023-06-18 21:54:12,224 INFO [train.py:996] (0/4) Epoch 2, batch 7500, loss[loss=0.3172, simple_loss=0.3897, pruned_loss=0.1224, over 21742.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3407, pruned_loss=0.1149, over 4261405.16 frames. ], batch size: 351, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:54:49,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=228030.0, ans=10.0 2023-06-18 21:55:58,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=228150.0, ans=0.2 2023-06-18 21:56:06,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228150.0, ans=0.1 2023-06-18 21:56:19,756 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-18 21:56:33,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.064e+02 3.636e+02 4.254e+02 7.923e+02, threshold=7.272e+02, percent-clipped=2.0 2023-06-18 21:56:33,971 INFO [train.py:996] (0/4) Epoch 2, batch 7550, loss[loss=0.2853, simple_loss=0.3648, pruned_loss=0.1029, over 20669.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.349, pruned_loss=0.1136, over 4269881.36 frames. ], batch size: 608, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 21:57:17,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-18 21:58:14,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=228450.0, ans=0.0 2023-06-18 21:58:28,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-18 21:58:30,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=228510.0, ans=0.125 2023-06-18 21:58:51,174 INFO [train.py:996] (0/4) Epoch 2, batch 7600, loss[loss=0.2852, simple_loss=0.3368, pruned_loss=0.1168, over 21297.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3475, pruned_loss=0.1122, over 4276242.83 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 21:59:00,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=228570.0, ans=0.2 2023-06-18 21:59:55,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=228690.0, ans=0.125 2023-06-18 22:00:33,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=228750.0, ans=0.125 2023-06-18 22:01:06,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.201e+02 4.085e+02 5.282e+02 1.069e+03, threshold=8.169e+02, percent-clipped=6.0 2023-06-18 22:01:06,636 INFO [train.py:996] (0/4) Epoch 2, batch 7650, loss[loss=0.2975, simple_loss=0.342, pruned_loss=0.1265, over 21932.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.347, pruned_loss=0.1146, over 4283032.89 frames. ], batch size: 351, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:01:22,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-18 22:01:30,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=228930.0, ans=0.125 2023-06-18 22:03:12,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=229110.0, ans=0.125 2023-06-18 22:03:23,246 INFO [train.py:996] (0/4) Epoch 2, batch 7700, loss[loss=0.2969, simple_loss=0.3492, pruned_loss=0.1223, over 21642.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3493, pruned_loss=0.1179, over 4284058.54 frames. ], batch size: 263, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:04:11,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=229290.0, ans=0.125 2023-06-18 22:04:44,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=229350.0, ans=0.125 2023-06-18 22:05:10,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=229410.0, ans=0.0 2023-06-18 22:05:29,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=229410.0, ans=0.0 2023-06-18 22:05:39,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.910e+02 3.418e+02 4.204e+02 6.924e+02, threshold=6.836e+02, percent-clipped=0.0 2023-06-18 22:05:39,918 INFO [train.py:996] (0/4) Epoch 2, batch 7750, loss[loss=0.4937, simple_loss=0.5401, pruned_loss=0.2236, over 21402.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3556, pruned_loss=0.1183, over 4284387.94 frames. ], batch size: 507, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:05:47,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=229470.0, ans=0.0 2023-06-18 22:06:15,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-18 22:07:11,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=229650.0, ans=0.0 2023-06-18 22:07:38,205 INFO [train.py:996] (0/4) Epoch 2, batch 7800, loss[loss=0.2452, simple_loss=0.3063, pruned_loss=0.0921, over 21623.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3565, pruned_loss=0.1175, over 4276700.98 frames. ], batch size: 263, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:08:49,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-18 22:09:12,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=12.0 2023-06-18 22:09:36,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230010.0, ans=0.1 2023-06-18 22:09:46,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.917e+02 3.531e+02 4.288e+02 7.064e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-18 22:09:46,332 INFO [train.py:996] (0/4) Epoch 2, batch 7850, loss[loss=0.2783, simple_loss=0.3245, pruned_loss=0.116, over 21845.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3483, pruned_loss=0.1155, over 4269958.02 frames. ], batch size: 373, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:09:51,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-18 22:09:54,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=230070.0, ans=0.0 2023-06-18 22:11:52,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=22.5 2023-06-18 22:11:56,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230310.0, ans=0.1 2023-06-18 22:12:06,690 INFO [train.py:996] (0/4) Epoch 2, batch 7900, loss[loss=0.3875, simple_loss=0.4496, pruned_loss=0.1627, over 21586.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3444, pruned_loss=0.1149, over 4257825.99 frames. ], batch size: 441, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:14:33,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.975e+02 3.296e+02 3.925e+02 7.503e+02, threshold=6.593e+02, percent-clipped=2.0 2023-06-18 22:14:33,535 INFO [train.py:996] (0/4) Epoch 2, batch 7950, loss[loss=0.2666, simple_loss=0.3466, pruned_loss=0.09326, over 21642.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3518, pruned_loss=0.1155, over 4257851.85 frames. ], batch size: 263, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:14:35,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=230670.0, ans=0.2 2023-06-18 22:15:26,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=230730.0, ans=0.125 2023-06-18 22:16:10,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.45 vs. limit=22.5 2023-06-18 22:16:56,568 INFO [train.py:996] (0/4) Epoch 2, batch 8000, loss[loss=0.2886, simple_loss=0.3498, pruned_loss=0.1137, over 21332.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3575, pruned_loss=0.1192, over 4265088.88 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 22:17:00,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=8.0 2023-06-18 22:18:58,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231150.0, ans=0.125 2023-06-18 22:19:03,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231210.0, ans=0.125 2023-06-18 22:19:34,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.162e+02 3.821e+02 4.807e+02 7.380e+02, threshold=7.642e+02, percent-clipped=7.0 2023-06-18 22:19:34,481 INFO [train.py:996] (0/4) Epoch 2, batch 8050, loss[loss=0.2893, simple_loss=0.3315, pruned_loss=0.1236, over 20259.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.362, pruned_loss=0.12, over 4263712.49 frames. ], batch size: 702, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:19:37,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=231270.0, ans=0.125 2023-06-18 22:20:35,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=231390.0, ans=0.125 2023-06-18 22:21:11,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=231450.0, ans=0.07 2023-06-18 22:21:49,803 INFO [train.py:996] (0/4) Epoch 2, batch 8100, loss[loss=0.3023, simple_loss=0.3606, pruned_loss=0.122, over 21855.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3585, pruned_loss=0.1197, over 4271486.60 frames. ], batch size: 118, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:22:27,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.27 vs. limit=22.5 2023-06-18 22:22:34,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231630.0, ans=0.125 2023-06-18 22:23:18,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-18 22:24:01,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=231750.0, ans=0.125 2023-06-18 22:24:07,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=231810.0, ans=0.07 2023-06-18 22:24:08,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=231810.0, ans=0.125 2023-06-18 22:24:27,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-18 22:24:27,849 INFO [train.py:996] (0/4) Epoch 2, batch 8150, loss[loss=0.3805, simple_loss=0.4514, pruned_loss=0.1548, over 21513.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3632, pruned_loss=0.1203, over 4268744.39 frames. ], batch size: 507, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:24:34,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.988e+02 3.821e+02 5.220e+02 8.604e+02, threshold=7.643e+02, percent-clipped=3.0 2023-06-18 22:25:19,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=231930.0, ans=0.125 2023-06-18 22:25:19,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-18 22:25:45,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231990.0, ans=0.1 2023-06-18 22:25:58,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232050.0, ans=0.1 2023-06-18 22:26:02,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-18 22:26:26,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=232110.0, ans=0.2 2023-06-18 22:26:31,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-18 22:26:44,096 INFO [train.py:996] (0/4) Epoch 2, batch 8200, loss[loss=0.2498, simple_loss=0.2963, pruned_loss=0.1017, over 21569.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3562, pruned_loss=0.1161, over 4266646.94 frames. ], batch size: 263, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:27:18,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=232230.0, ans=0.2 2023-06-18 22:27:48,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=232290.0, ans=0.125 2023-06-18 22:28:06,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-18 22:28:11,302 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:28:32,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=232410.0, ans=0.125 2023-06-18 22:28:36,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=232410.0, ans=0.0 2023-06-18 22:28:43,271 INFO [train.py:996] (0/4) Epoch 2, batch 8250, loss[loss=0.2663, simple_loss=0.3156, pruned_loss=0.1085, over 21996.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3548, pruned_loss=0.1158, over 4267636.03 frames. ], batch size: 103, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:28:44,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-18 22:28:44,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.255e+02 3.894e+02 4.730e+02 9.572e+02, threshold=7.788e+02, percent-clipped=3.0 2023-06-18 22:29:59,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=232590.0, ans=0.125 2023-06-18 22:30:46,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=232710.0, ans=0.0 2023-06-18 22:31:06,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232770.0, ans=0.1 2023-06-18 22:31:07,385 INFO [train.py:996] (0/4) Epoch 2, batch 8300, loss[loss=0.345, simple_loss=0.3997, pruned_loss=0.1452, over 21607.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3513, pruned_loss=0.1122, over 4266529.29 frames. ], batch size: 441, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:31:26,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=232770.0, ans=0.0 2023-06-18 22:32:35,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=232950.0, ans=0.95 2023-06-18 22:32:59,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=233010.0, ans=0.2 2023-06-18 22:33:23,367 INFO [train.py:996] (0/4) Epoch 2, batch 8350, loss[loss=0.2831, simple_loss=0.3453, pruned_loss=0.1104, over 21520.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3499, pruned_loss=0.1099, over 4273024.35 frames. ], batch size: 389, lr: 1.79e-02, grad_scale: 16.0 2023-06-18 22:33:30,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.718e+02 3.193e+02 3.748e+02 6.520e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-18 22:34:27,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=233190.0, ans=0.125 2023-06-18 22:34:28,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-18 22:34:56,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=233250.0, ans=0.05 2023-06-18 22:35:52,760 INFO [train.py:996] (0/4) Epoch 2, batch 8400, loss[loss=0.2592, simple_loss=0.3482, pruned_loss=0.08507, over 21227.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3449, pruned_loss=0.1049, over 4278273.96 frames. ], batch size: 548, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 22:36:16,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=233430.0, ans=0.0 2023-06-18 22:36:29,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=233430.0, ans=0.125 2023-06-18 22:38:00,539 INFO [train.py:996] (0/4) Epoch 2, batch 8450, loss[loss=0.3069, simple_loss=0.3515, pruned_loss=0.1311, over 21751.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3438, pruned_loss=0.1065, over 4282072.47 frames. ], batch size: 389, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:38:01,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-18 22:38:02,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.629e+02 3.179e+02 4.022e+02 7.095e+02, threshold=6.359e+02, percent-clipped=3.0 2023-06-18 22:38:06,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=233670.0, ans=0.2 2023-06-18 22:39:01,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=233790.0, ans=0.0 2023-06-18 22:39:23,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=233850.0, ans=0.09899494936611666 2023-06-18 22:39:27,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=233910.0, ans=0.125 2023-06-18 22:39:55,619 INFO [train.py:996] (0/4) Epoch 2, batch 8500, loss[loss=0.3082, simple_loss=0.416, pruned_loss=0.1002, over 20731.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3413, pruned_loss=0.1088, over 4275615.53 frames. ], batch size: 607, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:39:56,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=233970.0, ans=0.2 2023-06-18 22:40:01,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=233970.0, ans=0.125 2023-06-18 22:40:04,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-18 22:40:42,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-18 22:41:04,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234090.0, ans=0.125 2023-06-18 22:42:16,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=234270.0, ans=0.07 2023-06-18 22:42:28,183 INFO [train.py:996] (0/4) Epoch 2, batch 8550, loss[loss=0.2646, simple_loss=0.3345, pruned_loss=0.09736, over 21289.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3469, pruned_loss=0.113, over 4275537.19 frames. ], batch size: 159, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:42:29,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.313e+02 3.994e+02 5.022e+02 7.456e+02, threshold=7.988e+02, percent-clipped=4.0 2023-06-18 22:43:00,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=234330.0, ans=0.125 2023-06-18 22:43:04,011 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:43:36,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=234390.0, ans=0.015 2023-06-18 22:43:40,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.42 vs. limit=10.0 2023-06-18 22:44:41,773 INFO [train.py:996] (0/4) Epoch 2, batch 8600, loss[loss=0.3817, simple_loss=0.4289, pruned_loss=0.1673, over 21290.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3529, pruned_loss=0.1154, over 4279345.67 frames. ], batch size: 143, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:44:42,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234570.0, ans=0.1 2023-06-18 22:44:43,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=234570.0, ans=10.0 2023-06-18 22:45:20,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=234630.0, ans=0.2 2023-06-18 22:45:24,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=234630.0, ans=0.0 2023-06-18 22:45:43,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=22.5 2023-06-18 22:46:36,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=234810.0, ans=0.125 2023-06-18 22:47:07,831 INFO [train.py:996] (0/4) Epoch 2, batch 8650, loss[loss=0.3152, simple_loss=0.3792, pruned_loss=0.1256, over 21462.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.359, pruned_loss=0.1156, over 4268327.31 frames. ], batch size: 211, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:47:14,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 3.220e+02 3.691e+02 4.621e+02 7.023e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-18 22:47:33,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=234930.0, ans=0.125 2023-06-18 22:47:34,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=234930.0, ans=0.02 2023-06-18 22:47:58,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=234990.0, ans=0.0 2023-06-18 22:48:49,042 INFO [train.py:996] (0/4) Epoch 2, batch 8700, loss[loss=0.2681, simple_loss=0.3167, pruned_loss=0.1097, over 21669.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3495, pruned_loss=0.1102, over 4261873.34 frames. ], batch size: 333, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:51:10,602 INFO [train.py:996] (0/4) Epoch 2, batch 8750, loss[loss=0.2828, simple_loss=0.3296, pruned_loss=0.118, over 21576.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3445, pruned_loss=0.1107, over 4267665.37 frames. ], batch size: 391, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:51:12,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.975e+02 3.407e+02 4.208e+02 7.792e+02, threshold=6.814e+02, percent-clipped=1.0 2023-06-18 22:51:43,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235530.0, ans=0.1 2023-06-18 22:52:33,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=235650.0, ans=0.0 2023-06-18 22:52:43,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-18 22:53:34,727 INFO [train.py:996] (0/4) Epoch 2, batch 8800, loss[loss=0.2634, simple_loss=0.37, pruned_loss=0.0784, over 19761.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3534, pruned_loss=0.1142, over 4272220.75 frames. ], batch size: 702, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:54:26,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=235890.0, ans=0.0 2023-06-18 22:55:52,236 INFO [train.py:996] (0/4) Epoch 2, batch 8850, loss[loss=0.2982, simple_loss=0.3571, pruned_loss=0.1197, over 21462.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3604, pruned_loss=0.1163, over 4268856.67 frames. ], batch size: 389, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:55:53,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.077e+02 3.659e+02 4.478e+02 7.244e+02, threshold=7.318e+02, percent-clipped=2.0 2023-06-18 22:56:24,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=236130.0, ans=0.125 2023-06-18 22:56:31,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-18 22:57:57,914 INFO [train.py:996] (0/4) Epoch 2, batch 8900, loss[loss=0.28, simple_loss=0.3534, pruned_loss=0.1033, over 21765.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3545, pruned_loss=0.1156, over 4268641.00 frames. ], batch size: 351, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 22:58:04,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236370.0, ans=0.1 2023-06-18 22:59:23,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236550.0, ans=0.1 2023-06-18 23:00:09,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=236610.0, ans=0.125 2023-06-18 23:00:24,374 INFO [train.py:996] (0/4) Epoch 2, batch 8950, loss[loss=0.2467, simple_loss=0.3506, pruned_loss=0.07143, over 20810.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3539, pruned_loss=0.1147, over 4267570.94 frames. ], batch size: 608, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:00:31,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.983e+02 3.644e+02 4.576e+02 8.464e+02, threshold=7.288e+02, percent-clipped=6.0 2023-06-18 23:00:44,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=236730.0, ans=0.2 2023-06-18 23:01:00,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=236730.0, ans=0.125 2023-06-18 23:01:24,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-18 23:01:55,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=236910.0, ans=0.125 2023-06-18 23:02:28,430 INFO [train.py:996] (0/4) Epoch 2, batch 9000, loss[loss=0.2555, simple_loss=0.3236, pruned_loss=0.09371, over 21726.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3503, pruned_loss=0.1147, over 4266801.80 frames. ], batch size: 282, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:02:28,431 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 23:03:33,319 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.9089, 2.6461, 4.5178, 4.6559], device='cuda:0') 2023-06-18 23:03:37,452 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2827, simple_loss=0.3814, pruned_loss=0.09199, over 1796401.00 frames. 2023-06-18 23:03:37,454 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-18 23:03:58,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=237030.0, ans=0.07 2023-06-18 23:04:13,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-18 23:04:29,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=237090.0, ans=0.125 2023-06-18 23:04:39,541 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:05:34,971 INFO [train.py:996] (0/4) Epoch 2, batch 9050, loss[loss=0.2705, simple_loss=0.3253, pruned_loss=0.1078, over 21547.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3461, pruned_loss=0.1108, over 4256693.86 frames. ], batch size: 230, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:05:36,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.756e+02 3.216e+02 4.039e+02 6.653e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-18 23:05:41,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237270.0, ans=0.1 2023-06-18 23:06:00,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=237330.0, ans=0.125 2023-06-18 23:07:21,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-18 23:07:43,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=237510.0, ans=0.125 2023-06-18 23:07:48,323 INFO [train.py:996] (0/4) Epoch 2, batch 9100, loss[loss=0.2872, simple_loss=0.3397, pruned_loss=0.1174, over 19912.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.351, pruned_loss=0.1143, over 4255873.65 frames. ], batch size: 703, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:07:50,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=237570.0, ans=0.0 2023-06-18 23:08:10,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=237570.0, ans=0.125 2023-06-18 23:08:20,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=237570.0, ans=0.0 2023-06-18 23:08:20,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=22.5 2023-06-18 23:08:40,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=237630.0, ans=0.0 2023-06-18 23:08:46,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=237690.0, ans=0.125 2023-06-18 23:09:55,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=237810.0, ans=0.2 2023-06-18 23:10:07,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=237870.0, ans=0.125 2023-06-18 23:10:08,431 INFO [train.py:996] (0/4) Epoch 2, batch 9150, loss[loss=0.3022, simple_loss=0.3754, pruned_loss=0.1145, over 21801.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3556, pruned_loss=0.1113, over 4256416.54 frames. ], batch size: 351, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:10:08,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=237870.0, ans=0.125 2023-06-18 23:10:18,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.926e+02 3.668e+02 4.524e+02 1.019e+03, threshold=7.337e+02, percent-clipped=2.0 2023-06-18 23:10:54,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=237930.0, ans=0.2 2023-06-18 23:10:56,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=237930.0, ans=0.125 2023-06-18 23:11:51,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=238050.0, ans=0.015 2023-06-18 23:12:07,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238050.0, ans=0.1 2023-06-18 23:12:10,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=238110.0, ans=0.125 2023-06-18 23:12:13,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=238110.0, ans=10.0 2023-06-18 23:12:24,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=238110.0, ans=0.0 2023-06-18 23:12:29,968 INFO [train.py:996] (0/4) Epoch 2, batch 9200, loss[loss=0.4036, simple_loss=0.4418, pruned_loss=0.1826, over 21486.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3569, pruned_loss=0.1095, over 4267168.45 frames. ], batch size: 471, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:12:31,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238170.0, ans=0.1 2023-06-18 23:12:51,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=238170.0, ans=0.0 2023-06-18 23:13:15,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-18 23:14:13,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=238350.0, ans=0.0 2023-06-18 23:14:40,573 INFO [train.py:996] (0/4) Epoch 2, batch 9250, loss[loss=0.2587, simple_loss=0.3062, pruned_loss=0.1055, over 21625.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3605, pruned_loss=0.1147, over 4269809.74 frames. ], batch size: 298, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:14:41,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.826e+02 3.250e+02 3.675e+02 5.270e+02, threshold=6.500e+02, percent-clipped=0.0 2023-06-18 23:14:58,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-18 23:14:59,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:15:29,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-18 23:16:21,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=238650.0, ans=0.2 2023-06-18 23:16:21,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238650.0, ans=0.1 2023-06-18 23:17:00,759 INFO [train.py:996] (0/4) Epoch 2, batch 9300, loss[loss=0.2754, simple_loss=0.3451, pruned_loss=0.1029, over 21657.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3538, pruned_loss=0.1155, over 4251211.74 frames. ], batch size: 332, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:17:04,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=238770.0, ans=0.125 2023-06-18 23:17:09,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=238770.0, ans=0.025 2023-06-18 23:17:10,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-18 23:17:11,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=238770.0, ans=0.0 2023-06-18 23:17:19,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=238830.0, ans=0.125 2023-06-18 23:18:22,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=238890.0, ans=0.125 2023-06-18 23:18:41,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=238950.0, ans=0.0 2023-06-18 23:19:01,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=239010.0, ans=0.125 2023-06-18 23:19:09,929 INFO [train.py:996] (0/4) Epoch 2, batch 9350, loss[loss=0.348, simple_loss=0.4003, pruned_loss=0.1479, over 21556.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.363, pruned_loss=0.1181, over 4258179.03 frames. ], batch size: 389, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 23:19:11,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 3.045e+02 3.493e+02 4.279e+02 7.856e+02, threshold=6.986e+02, percent-clipped=1.0 2023-06-18 23:19:11,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=239070.0, ans=0.125 2023-06-18 23:19:22,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-18 23:19:45,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239130.0, ans=0.1 2023-06-18 23:20:12,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=239190.0, ans=0.2 2023-06-18 23:20:23,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239190.0, ans=0.1 2023-06-18 23:21:25,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=239310.0, ans=0.125 2023-06-18 23:21:29,682 INFO [train.py:996] (0/4) Epoch 2, batch 9400, loss[loss=0.2888, simple_loss=0.3333, pruned_loss=0.1222, over 21567.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3652, pruned_loss=0.1195, over 4252676.98 frames. ], batch size: 414, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:21:41,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-18 23:21:47,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239370.0, ans=0.125 2023-06-18 23:22:49,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=239490.0, ans=0.0 2023-06-18 23:23:10,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-18 23:23:22,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239610.0, ans=0.125 2023-06-18 23:23:24,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=239610.0, ans=0.0 2023-06-18 23:23:39,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=239670.0, ans=0.2 2023-06-18 23:23:40,462 INFO [train.py:996] (0/4) Epoch 2, batch 9450, loss[loss=0.2514, simple_loss=0.3033, pruned_loss=0.09974, over 21775.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3563, pruned_loss=0.1172, over 4250131.17 frames. ], batch size: 124, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:23:41,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.730e+02 3.234e+02 3.774e+02 7.408e+02, threshold=6.469e+02, percent-clipped=1.0 2023-06-18 23:24:21,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=239730.0, ans=0.0 2023-06-18 23:24:51,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=12.0 2023-06-18 23:25:15,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=239850.0, ans=0.0 2023-06-18 23:25:45,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=239910.0, ans=0.0 2023-06-18 23:26:03,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-18 23:26:05,437 INFO [train.py:996] (0/4) Epoch 2, batch 9500, loss[loss=0.319, simple_loss=0.3778, pruned_loss=0.1301, over 21364.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3484, pruned_loss=0.114, over 4253353.92 frames. ], batch size: 131, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:26:11,641 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-40000.pt 2023-06-18 23:26:27,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=240030.0, ans=0.125 2023-06-18 23:27:34,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240150.0, ans=0.1 2023-06-18 23:27:37,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-18 23:28:15,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=240210.0, ans=0.5 2023-06-18 23:28:19,625 INFO [train.py:996] (0/4) Epoch 2, batch 9550, loss[loss=0.3135, simple_loss=0.382, pruned_loss=0.1225, over 21732.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3534, pruned_loss=0.1171, over 4256659.45 frames. ], batch size: 332, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:28:21,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.886e+02 3.595e+02 5.008e+02 8.398e+02, threshold=7.190e+02, percent-clipped=11.0 2023-06-18 23:29:06,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.95 vs. limit=22.5 2023-06-18 23:30:36,140 INFO [train.py:996] (0/4) Epoch 2, batch 9600, loss[loss=0.2905, simple_loss=0.3484, pruned_loss=0.1163, over 21737.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3545, pruned_loss=0.1182, over 4267275.59 frames. ], batch size: 389, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:32:05,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=240750.0, ans=0.0 2023-06-18 23:32:09,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240750.0, ans=0.1 2023-06-18 23:32:11,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=240750.0, ans=0.125 2023-06-18 23:32:45,694 INFO [train.py:996] (0/4) Epoch 2, batch 9650, loss[loss=0.2683, simple_loss=0.3299, pruned_loss=0.1034, over 21838.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3532, pruned_loss=0.1174, over 4275186.30 frames. ], batch size: 247, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:33:00,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.899e+02 3.277e+02 4.400e+02 6.787e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-18 23:34:03,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-18 23:34:05,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=240990.0, ans=0.1 2023-06-18 23:35:17,953 INFO [train.py:996] (0/4) Epoch 2, batch 9700, loss[loss=0.3004, simple_loss=0.3569, pruned_loss=0.122, over 21477.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3565, pruned_loss=0.1174, over 4278317.34 frames. ], batch size: 548, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:35:22,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=241170.0, ans=0.125 2023-06-18 23:36:10,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=241290.0, ans=0.0 2023-06-18 23:36:22,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=241350.0, ans=0.125 2023-06-18 23:37:01,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-18 23:37:09,757 INFO [train.py:996] (0/4) Epoch 2, batch 9750, loss[loss=0.3238, simple_loss=0.335, pruned_loss=0.1563, over 21358.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3486, pruned_loss=0.1157, over 4277634.21 frames. ], batch size: 508, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:37:11,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.764e+02 3.183e+02 3.672e+02 6.481e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-18 23:37:20,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=241470.0, ans=0.0 2023-06-18 23:37:29,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-18 23:37:40,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=241530.0, ans=0.125 2023-06-18 23:37:43,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2023-06-18 23:37:59,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=241590.0, ans=0.0 2023-06-18 23:38:47,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=241710.0, ans=0.0 2023-06-18 23:39:09,322 INFO [train.py:996] (0/4) Epoch 2, batch 9800, loss[loss=0.3403, simple_loss=0.3739, pruned_loss=0.1534, over 21583.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.347, pruned_loss=0.1158, over 4271553.88 frames. ], batch size: 471, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 23:39:18,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=6.0 2023-06-18 23:40:40,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=242010.0, ans=0.0 2023-06-18 23:40:54,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242010.0, ans=0.1 2023-06-18 23:41:00,961 INFO [train.py:996] (0/4) Epoch 2, batch 9850, loss[loss=0.2552, simple_loss=0.299, pruned_loss=0.1057, over 20750.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3438, pruned_loss=0.1156, over 4276700.36 frames. ], batch size: 607, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:41:02,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.994e+02 3.598e+02 4.495e+02 7.134e+02, threshold=7.196e+02, percent-clipped=3.0 2023-06-18 23:42:06,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=242190.0, ans=0.125 2023-06-18 23:42:07,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242190.0, ans=0.1 2023-06-18 23:43:09,485 INFO [train.py:996] (0/4) Epoch 2, batch 9900, loss[loss=0.2916, simple_loss=0.3318, pruned_loss=0.1257, over 21256.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3423, pruned_loss=0.116, over 4275229.55 frames. ], batch size: 471, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:44:02,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=242430.0, ans=0.0 2023-06-18 23:44:14,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-18 23:45:24,185 INFO [train.py:996] (0/4) Epoch 2, batch 9950, loss[loss=0.2752, simple_loss=0.3148, pruned_loss=0.1178, over 21650.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3455, pruned_loss=0.119, over 4269471.19 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:45:25,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.032e+02 3.636e+02 4.504e+02 8.608e+02, threshold=7.273e+02, percent-clipped=3.0 2023-06-18 23:45:32,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=242670.0, ans=0.125 2023-06-18 23:45:43,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=242730.0, ans=0.035 2023-06-18 23:46:15,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=22.5 2023-06-18 23:46:48,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=242850.0, ans=0.125 2023-06-18 23:47:07,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=242910.0, ans=0.2 2023-06-18 23:47:25,780 INFO [train.py:996] (0/4) Epoch 2, batch 10000, loss[loss=0.2411, simple_loss=0.3037, pruned_loss=0.08925, over 21412.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3404, pruned_loss=0.1164, over 4266243.97 frames. ], batch size: 211, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:48:38,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-18 23:48:55,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-18 23:49:30,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=243210.0, ans=0.0 2023-06-18 23:49:31,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=243210.0, ans=0.09899494936611666 2023-06-18 23:49:46,147 INFO [train.py:996] (0/4) Epoch 2, batch 10050, loss[loss=0.2895, simple_loss=0.339, pruned_loss=0.12, over 21591.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3428, pruned_loss=0.1165, over 4266227.81 frames. ], batch size: 441, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:49:47,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.747e+02 3.338e+02 4.174e+02 6.778e+02, threshold=6.677e+02, percent-clipped=0.0 2023-06-18 23:51:50,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=243510.0, ans=0.125 2023-06-18 23:51:58,744 INFO [train.py:996] (0/4) Epoch 2, batch 10100, loss[loss=0.3501, simple_loss=0.4034, pruned_loss=0.1484, over 21478.00 frames. ], tot_loss[loss=0.282, simple_loss=0.338, pruned_loss=0.113, over 4254268.46 frames. ], batch size: 471, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:52:05,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=243570.0, ans=0.0 2023-06-18 23:52:19,921 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:52:46,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=15.0 2023-06-18 23:53:16,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=243750.0, ans=0.2 2023-06-18 23:54:11,447 INFO [train.py:996] (0/4) Epoch 2, batch 10150, loss[loss=0.2795, simple_loss=0.337, pruned_loss=0.111, over 21378.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3455, pruned_loss=0.1161, over 4259361.36 frames. ], batch size: 144, lr: 1.75e-02, grad_scale: 64.0 2023-06-18 23:54:12,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.705e+02 3.448e+02 4.239e+02 1.060e+03, threshold=6.897e+02, percent-clipped=5.0 2023-06-18 23:55:13,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-18 23:56:12,734 INFO [train.py:996] (0/4) Epoch 2, batch 10200, loss[loss=0.2426, simple_loss=0.3093, pruned_loss=0.08792, over 21749.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3441, pruned_loss=0.1134, over 4251620.55 frames. ], batch size: 112, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:58:30,100 INFO [train.py:996] (0/4) Epoch 2, batch 10250, loss[loss=0.1921, simple_loss=0.2833, pruned_loss=0.05047, over 21792.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3383, pruned_loss=0.1063, over 4253716.87 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 23:58:32,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.447e+02 2.831e+02 3.418e+02 6.746e+02, threshold=5.661e+02, percent-clipped=0.0 2023-06-18 23:58:42,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=244470.0, ans=0.125 2023-06-19 00:00:13,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-19 00:00:27,731 INFO [train.py:996] (0/4) Epoch 2, batch 10300, loss[loss=0.2253, simple_loss=0.3028, pruned_loss=0.07394, over 21877.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3385, pruned_loss=0.1052, over 4262405.56 frames. ], batch size: 107, lr: 1.75e-02, grad_scale: 32.0 2023-06-19 00:00:56,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=244830.0, ans=0.125 2023-06-19 00:01:25,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=244890.0, ans=0.07 2023-06-19 00:01:27,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=244890.0, ans=0.125 2023-06-19 00:02:37,382 INFO [train.py:996] (0/4) Epoch 2, batch 10350, loss[loss=0.2579, simple_loss=0.3252, pruned_loss=0.09532, over 21262.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3418, pruned_loss=0.1068, over 4262583.94 frames. ], batch size: 176, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:02:45,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 2.780e+02 3.474e+02 4.368e+02 7.573e+02, threshold=6.948e+02, percent-clipped=10.0 2023-06-19 00:03:08,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=245130.0, ans=0.125 2023-06-19 00:03:40,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=245190.0, ans=0.125 2023-06-19 00:03:52,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-19 00:04:12,140 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:04:24,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=245310.0, ans=0.0 2023-06-19 00:04:49,437 INFO [train.py:996] (0/4) Epoch 2, batch 10400, loss[loss=0.2669, simple_loss=0.3308, pruned_loss=0.1015, over 21705.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3338, pruned_loss=0.1042, over 4262984.21 frames. ], batch size: 391, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:04:56,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=12.0 2023-06-19 00:05:14,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=245430.0, ans=0.125 2023-06-19 00:05:23,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=245490.0, ans=0.125 2023-06-19 00:05:48,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=245490.0, ans=0.0 2023-06-19 00:05:56,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=245550.0, ans=0.2 2023-06-19 00:06:10,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-19 00:06:25,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=245610.0, ans=0.0 2023-06-19 00:06:53,258 INFO [train.py:996] (0/4) Epoch 2, batch 10450, loss[loss=0.2973, simple_loss=0.3559, pruned_loss=0.1194, over 21463.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3388, pruned_loss=0.1093, over 4263918.74 frames. ], batch size: 194, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:07:07,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.343e+02 4.131e+02 5.392e+02 9.378e+02, threshold=8.262e+02, percent-clipped=3.0 2023-06-19 00:07:12,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=245670.0, ans=0.125 2023-06-19 00:07:30,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=245730.0, ans=0.125 2023-06-19 00:08:33,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245850.0, ans=0.1 2023-06-19 00:09:11,778 INFO [train.py:996] (0/4) Epoch 2, batch 10500, loss[loss=0.2927, simple_loss=0.3495, pruned_loss=0.1179, over 20685.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3375, pruned_loss=0.1081, over 4262375.03 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:10:37,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=246150.0, ans=0.2 2023-06-19 00:10:44,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=246150.0, ans=0.125 2023-06-19 00:11:08,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=246210.0, ans=0.125 2023-06-19 00:11:20,171 INFO [train.py:996] (0/4) Epoch 2, batch 10550, loss[loss=0.2574, simple_loss=0.3119, pruned_loss=0.1015, over 21883.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3324, pruned_loss=0.1078, over 4247907.07 frames. ], batch size: 373, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:11:23,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.402e+02 2.798e+02 3.186e+02 5.857e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-19 00:12:05,225 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:12:41,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=246450.0, ans=0.125 2023-06-19 00:13:08,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=246510.0, ans=0.0 2023-06-19 00:13:18,269 INFO [train.py:996] (0/4) Epoch 2, batch 10600, loss[loss=0.2393, simple_loss=0.2965, pruned_loss=0.09107, over 15418.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3291, pruned_loss=0.1056, over 4242196.77 frames. ], batch size: 60, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:14:13,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=246630.0, ans=0.125 2023-06-19 00:14:13,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246630.0, ans=0.1 2023-06-19 00:15:02,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=246750.0, ans=0.125 2023-06-19 00:15:05,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=246750.0, ans=0.015 2023-06-19 00:15:45,042 INFO [train.py:996] (0/4) Epoch 2, batch 10650, loss[loss=0.2035, simple_loss=0.3082, pruned_loss=0.04935, over 20796.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3307, pruned_loss=0.1038, over 4244455.61 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:15:54,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-19 00:15:55,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.840e+02 3.386e+02 4.217e+02 7.399e+02, threshold=6.773e+02, percent-clipped=7.0 2023-06-19 00:16:47,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=246990.0, ans=0.125 2023-06-19 00:16:51,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=246990.0, ans=10.0 2023-06-19 00:17:04,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=247050.0, ans=0.125 2023-06-19 00:17:17,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-19 00:18:08,606 INFO [train.py:996] (0/4) Epoch 2, batch 10700, loss[loss=0.3216, simple_loss=0.3931, pruned_loss=0.125, over 21768.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3305, pruned_loss=0.1042, over 4251017.07 frames. ], batch size: 124, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:18:50,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=247230.0, ans=0.035 2023-06-19 00:20:29,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=247470.0, ans=0.125 2023-06-19 00:20:30,659 INFO [train.py:996] (0/4) Epoch 2, batch 10750, loss[loss=0.3261, simple_loss=0.4082, pruned_loss=0.1221, over 21754.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3438, pruned_loss=0.1106, over 4255301.36 frames. ], batch size: 351, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:20:32,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=247470.0, ans=0.2 2023-06-19 00:20:33,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.858e+02 3.262e+02 4.061e+02 7.112e+02, threshold=6.525e+02, percent-clipped=1.0 2023-06-19 00:20:50,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=247470.0, ans=0.125 2023-06-19 00:22:56,670 INFO [train.py:996] (0/4) Epoch 2, batch 10800, loss[loss=0.3131, simple_loss=0.3728, pruned_loss=0.1267, over 21445.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3511, pruned_loss=0.1123, over 4258481.21 frames. ], batch size: 194, lr: 1.74e-02, grad_scale: 32.0 2023-06-19 00:22:57,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=247770.0, ans=0.2 2023-06-19 00:23:35,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-19 00:23:46,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-19 00:23:48,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-19 00:23:50,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=247890.0, ans=0.125 2023-06-19 00:24:08,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247890.0, ans=0.1 2023-06-19 00:24:49,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=248010.0, ans=0.125 2023-06-19 00:25:03,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=248010.0, ans=0.125 2023-06-19 00:25:17,897 INFO [train.py:996] (0/4) Epoch 2, batch 10850, loss[loss=0.3317, simple_loss=0.3795, pruned_loss=0.1419, over 20744.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3513, pruned_loss=0.1126, over 4256936.58 frames. ], batch size: 607, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:25:21,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.702e+02 3.119e+02 3.804e+02 6.070e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-19 00:25:54,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=248130.0, ans=0.2 2023-06-19 00:25:56,280 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:26:11,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=248190.0, ans=0.125 2023-06-19 00:26:41,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=248250.0, ans=0.125 2023-06-19 00:27:07,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=248310.0, ans=0.035 2023-06-19 00:27:21,647 INFO [train.py:996] (0/4) Epoch 2, batch 10900, loss[loss=0.2364, simple_loss=0.2679, pruned_loss=0.1025, over 19926.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3441, pruned_loss=0.1101, over 4252984.56 frames. ], batch size: 702, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:27:52,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-19 00:28:49,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=248550.0, ans=0.125 2023-06-19 00:29:07,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248610.0, ans=0.1 2023-06-19 00:29:12,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=248610.0, ans=0.125 2023-06-19 00:29:18,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=248610.0, ans=0.5 2023-06-19 00:29:25,754 INFO [train.py:996] (0/4) Epoch 2, batch 10950, loss[loss=0.2533, simple_loss=0.3216, pruned_loss=0.09254, over 21744.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3387, pruned_loss=0.107, over 4241484.52 frames. ], batch size: 124, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:29:28,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.875e+02 3.449e+02 4.246e+02 8.484e+02, threshold=6.899e+02, percent-clipped=4.0 2023-06-19 00:30:12,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=248730.0, ans=0.2 2023-06-19 00:30:18,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=248730.0, ans=0.125 2023-06-19 00:30:18,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-19 00:30:33,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=248790.0, ans=0.0 2023-06-19 00:30:35,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-19 00:31:18,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-19 00:31:26,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=248910.0, ans=0.0 2023-06-19 00:31:36,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=248910.0, ans=0.0 2023-06-19 00:31:37,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=248910.0, ans=0.1 2023-06-19 00:31:43,334 INFO [train.py:996] (0/4) Epoch 2, batch 11000, loss[loss=0.336, simple_loss=0.3678, pruned_loss=0.1521, over 21750.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3368, pruned_loss=0.1081, over 4248256.26 frames. ], batch size: 508, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:32:10,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=248970.0, ans=0.125 2023-06-19 00:32:55,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=249150.0, ans=0.2 2023-06-19 00:32:58,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=249150.0, ans=0.2 2023-06-19 00:33:47,311 INFO [train.py:996] (0/4) Epoch 2, batch 11050, loss[loss=0.2462, simple_loss=0.284, pruned_loss=0.1042, over 21272.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3357, pruned_loss=0.1097, over 4245390.85 frames. ], batch size: 548, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:33:50,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.804e+02 3.313e+02 4.113e+02 7.332e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-19 00:34:05,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-19 00:34:58,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-19 00:35:21,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249510.0, ans=0.1 2023-06-19 00:35:30,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.50 vs. limit=10.0 2023-06-19 00:35:40,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=249570.0, ans=0.125 2023-06-19 00:35:41,956 INFO [train.py:996] (0/4) Epoch 2, batch 11100, loss[loss=0.2558, simple_loss=0.3138, pruned_loss=0.09887, over 21291.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3332, pruned_loss=0.1095, over 4244285.41 frames. ], batch size: 194, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:35:44,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=249570.0, ans=0.125 2023-06-19 00:35:51,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=249570.0, ans=0.125 2023-06-19 00:36:01,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=249630.0, ans=0.2 2023-06-19 00:36:08,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=249630.0, ans=0.05 2023-06-19 00:36:11,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=249630.0, ans=0.015 2023-06-19 00:36:20,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=249630.0, ans=0.125 2023-06-19 00:36:37,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=249690.0, ans=0.5 2023-06-19 00:37:45,294 INFO [train.py:996] (0/4) Epoch 2, batch 11150, loss[loss=0.257, simple_loss=0.3115, pruned_loss=0.1012, over 21793.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3318, pruned_loss=0.1091, over 4256627.14 frames. ], batch size: 98, lr: 1.73e-02, grad_scale: 16.0 2023-06-19 00:37:49,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.743e+02 3.116e+02 3.616e+02 5.732e+02, threshold=6.232e+02, percent-clipped=1.0 2023-06-19 00:37:50,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.16 vs. limit=22.5 2023-06-19 00:37:51,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=249870.0, ans=0.125 2023-06-19 00:38:35,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=249930.0, ans=0.125 2023-06-19 00:38:50,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=249990.0, ans=0.125 2023-06-19 00:39:05,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250050.0, ans=0.1 2023-06-19 00:39:06,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-19 00:39:13,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=250050.0, ans=0.0 2023-06-19 00:39:38,501 INFO [train.py:996] (0/4) Epoch 2, batch 11200, loss[loss=0.2484, simple_loss=0.2936, pruned_loss=0.1016, over 21563.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3288, pruned_loss=0.1077, over 4263530.37 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:39:41,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=250170.0, ans=0.125 2023-06-19 00:39:44,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=250170.0, ans=0.2 2023-06-19 00:40:24,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=250230.0, ans=0.0 2023-06-19 00:41:08,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-19 00:41:09,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250350.0, ans=0.1 2023-06-19 00:41:52,958 INFO [train.py:996] (0/4) Epoch 2, batch 11250, loss[loss=0.2624, simple_loss=0.3353, pruned_loss=0.09477, over 21649.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3281, pruned_loss=0.1075, over 4275237.76 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:41:57,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.808e+02 3.237e+02 3.673e+02 7.595e+02, threshold=6.473e+02, percent-clipped=2.0 2023-06-19 00:42:15,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=250470.0, ans=0.0 2023-06-19 00:42:17,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=250470.0, ans=0.0 2023-06-19 00:42:40,648 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:42:48,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=250590.0, ans=10.0 2023-06-19 00:43:19,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=250650.0, ans=0.125 2023-06-19 00:44:02,657 INFO [train.py:996] (0/4) Epoch 2, batch 11300, loss[loss=0.2804, simple_loss=0.3386, pruned_loss=0.1111, over 21814.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3299, pruned_loss=0.1074, over 4278638.42 frames. ], batch size: 332, lr: 1.73e-02, grad_scale: 32.0 2023-06-19 00:44:15,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=250770.0, ans=0.125 2023-06-19 00:45:01,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-19 00:45:27,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=250950.0, ans=0.015 2023-06-19 00:45:30,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-19 00:45:53,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=250950.0, ans=0.2 2023-06-19 00:46:16,926 INFO [train.py:996] (0/4) Epoch 2, batch 11350, loss[loss=0.2205, simple_loss=0.2955, pruned_loss=0.07281, over 21510.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3294, pruned_loss=0.106, over 4268801.95 frames. ], batch size: 212, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:46:23,546 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.745e+02 3.314e+02 3.869e+02 6.937e+02, threshold=6.629e+02, percent-clipped=3.0 2023-06-19 00:47:02,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=251130.0, ans=0.07 2023-06-19 00:47:04,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=251130.0, ans=0.5 2023-06-19 00:48:03,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=251250.0, ans=0.125 2023-06-19 00:48:19,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-19 00:48:31,004 INFO [train.py:996] (0/4) Epoch 2, batch 11400, loss[loss=0.2995, simple_loss=0.3553, pruned_loss=0.1218, over 21592.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3364, pruned_loss=0.1094, over 4269113.18 frames. ], batch size: 263, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:48:31,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=251370.0, ans=0.0 2023-06-19 00:49:17,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=251430.0, ans=0.125 2023-06-19 00:49:28,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251490.0, ans=0.1 2023-06-19 00:50:08,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=251550.0, ans=0.0 2023-06-19 00:50:11,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251550.0, ans=0.1 2023-06-19 00:50:13,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=251550.0, ans=0.125 2023-06-19 00:50:42,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=251610.0, ans=0.125 2023-06-19 00:50:54,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-19 00:50:59,195 INFO [train.py:996] (0/4) Epoch 2, batch 11450, loss[loss=0.3046, simple_loss=0.359, pruned_loss=0.1252, over 21417.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3377, pruned_loss=0.1081, over 4270347.20 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:51:17,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.931e+02 3.430e+02 4.239e+02 8.118e+02, threshold=6.860e+02, percent-clipped=4.0 2023-06-19 00:53:22,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251970.0, ans=0.1 2023-06-19 00:53:23,626 INFO [train.py:996] (0/4) Epoch 2, batch 11500, loss[loss=0.2871, simple_loss=0.3581, pruned_loss=0.1081, over 21334.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3423, pruned_loss=0.1097, over 4271582.30 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:53:36,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-19 00:54:09,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=252030.0, ans=0.0 2023-06-19 00:54:24,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=252090.0, ans=0.125 2023-06-19 00:54:30,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=252090.0, ans=10.0 2023-06-19 00:55:34,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=252210.0, ans=0.0 2023-06-19 00:55:52,491 INFO [train.py:996] (0/4) Epoch 2, batch 11550, loss[loss=0.2939, simple_loss=0.3696, pruned_loss=0.1091, over 21619.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.349, pruned_loss=0.1107, over 4275616.88 frames. ], batch size: 230, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:56:10,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 3.008e+02 3.655e+02 4.469e+02 7.997e+02, threshold=7.310e+02, percent-clipped=1.0 2023-06-19 00:56:35,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=252330.0, ans=0.125 2023-06-19 00:56:51,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-19 00:58:11,602 INFO [train.py:996] (0/4) Epoch 2, batch 11600, loss[loss=0.2644, simple_loss=0.3464, pruned_loss=0.09119, over 21822.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3605, pruned_loss=0.1123, over 4277259.47 frames. ], batch size: 124, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 00:59:03,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=252690.0, ans=0.02 2023-06-19 00:59:48,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=22.5 2023-06-19 01:00:26,315 INFO [train.py:996] (0/4) Epoch 2, batch 11650, loss[loss=0.2647, simple_loss=0.3631, pruned_loss=0.08312, over 21458.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3657, pruned_loss=0.1116, over 4278048.62 frames. ], batch size: 194, lr: 1.72e-02, grad_scale: 32.0 2023-06-19 01:00:30,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.719e+02 3.568e+02 4.744e+02 9.384e+02, threshold=7.136e+02, percent-clipped=4.0 2023-06-19 01:00:50,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.91 vs. limit=12.0 2023-06-19 01:01:41,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=253050.0, ans=0.0 2023-06-19 01:02:25,804 INFO [train.py:996] (0/4) Epoch 2, batch 11700, loss[loss=0.2762, simple_loss=0.3221, pruned_loss=0.1152, over 21825.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3592, pruned_loss=0.1131, over 4286293.67 frames. ], batch size: 372, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:02:29,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=253170.0, ans=0.0 2023-06-19 01:03:15,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-19 01:03:31,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-19 01:03:57,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=253350.0, ans=0.125 2023-06-19 01:04:23,353 INFO [train.py:996] (0/4) Epoch 2, batch 11750, loss[loss=0.2633, simple_loss=0.306, pruned_loss=0.1103, over 21253.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3498, pruned_loss=0.1127, over 4277701.35 frames. ], batch size: 159, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:04:42,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.061e+02 3.639e+02 4.435e+02 7.294e+02, threshold=7.278e+02, percent-clipped=2.0 2023-06-19 01:05:11,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=253590.0, ans=0.2 2023-06-19 01:05:15,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253590.0, ans=0.1 2023-06-19 01:05:42,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=253590.0, ans=0.0 2023-06-19 01:06:45,928 INFO [train.py:996] (0/4) Epoch 2, batch 11800, loss[loss=0.3126, simple_loss=0.3718, pruned_loss=0.1267, over 21432.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3525, pruned_loss=0.1156, over 4280990.03 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 16.0 2023-06-19 01:07:26,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=253830.0, ans=0.0 2023-06-19 01:08:17,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-19 01:08:34,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=254010.0, ans=0.125 2023-06-19 01:08:39,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=254010.0, ans=0.125 2023-06-19 01:08:48,980 INFO [train.py:996] (0/4) Epoch 2, batch 11850, loss[loss=0.277, simple_loss=0.3469, pruned_loss=0.1035, over 21893.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3551, pruned_loss=0.1137, over 4278875.96 frames. ], batch size: 316, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:08:50,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=254070.0, ans=10.0 2023-06-19 01:09:00,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.827e+02 3.352e+02 4.207e+02 5.671e+02, threshold=6.705e+02, percent-clipped=0.0 2023-06-19 01:10:14,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=254190.0, ans=0.125 2023-06-19 01:10:42,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.26 vs. limit=6.0 2023-06-19 01:11:08,828 INFO [train.py:996] (0/4) Epoch 2, batch 11900, loss[loss=0.2579, simple_loss=0.3226, pruned_loss=0.09662, over 21664.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.354, pruned_loss=0.1112, over 4273053.94 frames. ], batch size: 247, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:11:20,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254370.0, ans=0.1 2023-06-19 01:11:26,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254370.0, ans=0.1 2023-06-19 01:11:29,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=254370.0, ans=0.125 2023-06-19 01:12:05,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.53 vs. limit=6.0 2023-06-19 01:12:13,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=254490.0, ans=0.125 2023-06-19 01:12:50,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=254550.0, ans=0.125 2023-06-19 01:13:30,196 INFO [train.py:996] (0/4) Epoch 2, batch 11950, loss[loss=0.2121, simple_loss=0.2852, pruned_loss=0.06951, over 21257.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3539, pruned_loss=0.1079, over 4270363.06 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:13:33,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=254670.0, ans=0.125 2023-06-19 01:13:35,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.552e+02 2.960e+02 3.499e+02 5.476e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-19 01:14:14,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=254730.0, ans=0.125 2023-06-19 01:14:29,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=254790.0, ans=0.125 2023-06-19 01:14:57,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-19 01:15:27,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=254910.0, ans=0.0 2023-06-19 01:15:47,314 INFO [train.py:996] (0/4) Epoch 2, batch 12000, loss[loss=0.2305, simple_loss=0.3037, pruned_loss=0.07863, over 21570.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3503, pruned_loss=0.1054, over 4268411.41 frames. ], batch size: 230, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:15:47,315 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 01:16:40,669 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2909, simple_loss=0.3809, pruned_loss=0.1004, over 1796401.00 frames. 2023-06-19 01:16:40,671 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 01:16:42,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=254970.0, ans=0.0 2023-06-19 01:17:15,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=255090.0, ans=22.5 2023-06-19 01:17:46,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=255090.0, ans=0.05 2023-06-19 01:18:19,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=255210.0, ans=0.05 2023-06-19 01:18:42,660 INFO [train.py:996] (0/4) Epoch 2, batch 12050, loss[loss=0.3057, simple_loss=0.3524, pruned_loss=0.1295, over 21167.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3467, pruned_loss=0.1079, over 4264589.38 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:18:57,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.833e+02 3.629e+02 4.945e+02 8.634e+02, threshold=7.258e+02, percent-clipped=13.0 2023-06-19 01:19:03,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=255270.0, ans=0.0 2023-06-19 01:19:35,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=255390.0, ans=0.125 2023-06-19 01:19:38,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=255390.0, ans=0.125 2023-06-19 01:20:07,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=255450.0, ans=0.125 2023-06-19 01:20:21,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-19 01:20:47,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=255510.0, ans=0.125 2023-06-19 01:20:56,935 INFO [train.py:996] (0/4) Epoch 2, batch 12100, loss[loss=0.3102, simple_loss=0.37, pruned_loss=0.1252, over 21370.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3509, pruned_loss=0.1119, over 4263421.92 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:21:51,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=255630.0, ans=0.0 2023-06-19 01:21:57,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=255690.0, ans=0.125 2023-06-19 01:23:10,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=255810.0, ans=0.0 2023-06-19 01:23:41,034 INFO [train.py:996] (0/4) Epoch 2, batch 12150, loss[loss=0.2968, simple_loss=0.3439, pruned_loss=0.1248, over 20781.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3539, pruned_loss=0.1123, over 4261726.10 frames. ], batch size: 611, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:23:50,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.161e+02 4.078e+02 5.260e+02 8.280e+02, threshold=8.155e+02, percent-clipped=4.0 2023-06-19 01:23:59,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=255870.0, ans=0.05 2023-06-19 01:25:03,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=256050.0, ans=0.0 2023-06-19 01:25:12,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=256050.0, ans=0.125 2023-06-19 01:25:24,982 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:25:33,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=256110.0, ans=0.0 2023-06-19 01:25:40,320 INFO [train.py:996] (0/4) Epoch 2, batch 12200, loss[loss=0.2907, simple_loss=0.3764, pruned_loss=0.1025, over 21192.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.35, pruned_loss=0.1111, over 4259263.30 frames. ], batch size: 548, lr: 1.71e-02, grad_scale: 32.0 2023-06-19 01:26:17,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-19 01:27:49,616 INFO [train.py:996] (0/4) Epoch 2, batch 12250, loss[loss=0.216, simple_loss=0.2835, pruned_loss=0.07431, over 21755.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3411, pruned_loss=0.1074, over 4255462.39 frames. ], batch size: 124, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:28:02,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.815e+02 3.229e+02 3.820e+02 6.594e+02, threshold=6.459e+02, percent-clipped=0.0 2023-06-19 01:28:06,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=256470.0, ans=0.125 2023-06-19 01:28:46,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=256590.0, ans=0.125 2023-06-19 01:29:02,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256590.0, ans=0.1 2023-06-19 01:29:57,179 INFO [train.py:996] (0/4) Epoch 2, batch 12300, loss[loss=0.2061, simple_loss=0.2724, pruned_loss=0.06991, over 21163.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3303, pruned_loss=0.09917, over 4255195.14 frames. ], batch size: 143, lr: 1.71e-02, grad_scale: 16.0 2023-06-19 01:30:19,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-19 01:31:01,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=256890.0, ans=0.125 2023-06-19 01:31:23,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=256950.0, ans=0.125 2023-06-19 01:31:29,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=256950.0, ans=0.2 2023-06-19 01:32:13,372 INFO [train.py:996] (0/4) Epoch 2, batch 12350, loss[loss=0.2814, simple_loss=0.3425, pruned_loss=0.1101, over 21616.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3356, pruned_loss=0.1016, over 4258716.55 frames. ], batch size: 263, lr: 1.70e-02, grad_scale: 16.0 2023-06-19 01:32:25,737 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.742e+02 3.231e+02 4.296e+02 8.197e+02, threshold=6.463e+02, percent-clipped=4.0 2023-06-19 01:32:59,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=257130.0, ans=0.125 2023-06-19 01:33:02,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-19 01:34:05,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=257310.0, ans=0.125 2023-06-19 01:34:17,934 INFO [train.py:996] (0/4) Epoch 2, batch 12400, loss[loss=0.3268, simple_loss=0.3693, pruned_loss=0.1421, over 21797.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3392, pruned_loss=0.1061, over 4267538.76 frames. ], batch size: 441, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:34:35,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257370.0, ans=0.1 2023-06-19 01:35:17,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-19 01:35:33,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-19 01:36:30,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=257550.0, ans=0.125 2023-06-19 01:37:03,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-19 01:37:04,444 INFO [train.py:996] (0/4) Epoch 2, batch 12450, loss[loss=0.3903, simple_loss=0.4229, pruned_loss=0.1789, over 21411.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3443, pruned_loss=0.111, over 4271394.36 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:37:11,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=257670.0, ans=0.125 2023-06-19 01:37:17,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.991e+02 3.683e+02 4.445e+02 7.854e+02, threshold=7.366e+02, percent-clipped=4.0 2023-06-19 01:37:25,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=257730.0, ans=0.0 2023-06-19 01:37:26,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=257730.0, ans=0.2 2023-06-19 01:37:28,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257730.0, ans=0.1 2023-06-19 01:38:11,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257790.0, ans=0.1 2023-06-19 01:39:11,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=257910.0, ans=0.2 2023-06-19 01:39:14,110 INFO [train.py:996] (0/4) Epoch 2, batch 12500, loss[loss=0.3394, simple_loss=0.4119, pruned_loss=0.1335, over 21275.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3552, pruned_loss=0.1152, over 4275904.75 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:41:08,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=258210.0, ans=0.09899494936611666 2023-06-19 01:41:33,988 INFO [train.py:996] (0/4) Epoch 2, batch 12550, loss[loss=0.2897, simple_loss=0.4016, pruned_loss=0.08892, over 20801.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3619, pruned_loss=0.118, over 4278063.21 frames. ], batch size: 607, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:41:47,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.352e+02 3.726e+02 4.299e+02 8.195e+02, threshold=7.451e+02, percent-clipped=1.0 2023-06-19 01:42:55,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=258390.0, ans=0.0 2023-06-19 01:43:52,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=258510.0, ans=0.125 2023-06-19 01:44:05,969 INFO [train.py:996] (0/4) Epoch 2, batch 12600, loss[loss=0.204, simple_loss=0.2791, pruned_loss=0.06441, over 21278.00 frames. ], tot_loss[loss=0.293, simple_loss=0.358, pruned_loss=0.1139, over 4279675.37 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:44:11,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=258570.0, ans=0.04949747468305833 2023-06-19 01:44:23,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=258570.0, ans=0.125 2023-06-19 01:44:49,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=258630.0, ans=0.2 2023-06-19 01:46:09,508 INFO [train.py:996] (0/4) Epoch 2, batch 12650, loss[loss=0.2822, simple_loss=0.3399, pruned_loss=0.1123, over 21494.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3486, pruned_loss=0.1084, over 4282713.99 frames. ], batch size: 131, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:46:10,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=258870.0, ans=12.0 2023-06-19 01:46:28,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.520e+02 3.258e+02 4.283e+02 8.969e+02, threshold=6.516e+02, percent-clipped=1.0 2023-06-19 01:46:28,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=258870.0, ans=0.125 2023-06-19 01:46:56,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258990.0, ans=0.1 2023-06-19 01:48:22,471 INFO [train.py:996] (0/4) Epoch 2, batch 12700, loss[loss=0.3583, simple_loss=0.3879, pruned_loss=0.1643, over 21573.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3494, pruned_loss=0.1118, over 4287947.76 frames. ], batch size: 507, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:48:43,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=259170.0, ans=0.125 2023-06-19 01:48:43,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-19 01:50:02,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.45 vs. limit=10.0 2023-06-19 01:50:11,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259410.0, ans=0.1 2023-06-19 01:50:33,914 INFO [train.py:996] (0/4) Epoch 2, batch 12750, loss[loss=0.2672, simple_loss=0.3471, pruned_loss=0.09369, over 21696.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3499, pruned_loss=0.1114, over 4282117.67 frames. ], batch size: 298, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:50:59,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.841e+02 3.473e+02 4.427e+02 7.212e+02, threshold=6.945e+02, percent-clipped=3.0 2023-06-19 01:51:12,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259530.0, ans=0.1 2023-06-19 01:51:55,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259650.0, ans=0.1 2023-06-19 01:52:59,507 INFO [train.py:996] (0/4) Epoch 2, batch 12800, loss[loss=0.288, simple_loss=0.3499, pruned_loss=0.113, over 21652.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3498, pruned_loss=0.1125, over 4283249.78 frames. ], batch size: 263, lr: 1.70e-02, grad_scale: 32.0 2023-06-19 01:54:08,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=259950.0, ans=0.04949747468305833 2023-06-19 01:54:56,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=260010.0, ans=0.125 2023-06-19 01:54:58,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=22.5 2023-06-19 01:55:03,239 INFO [train.py:996] (0/4) Epoch 2, batch 12850, loss[loss=0.2692, simple_loss=0.3508, pruned_loss=0.09376, over 21768.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3535, pruned_loss=0.1155, over 4285437.17 frames. ], batch size: 351, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:55:12,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.824e+02 3.270e+02 4.199e+02 6.279e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-19 01:55:20,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-19 01:56:46,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=260250.0, ans=0.95 2023-06-19 01:57:28,200 INFO [train.py:996] (0/4) Epoch 2, batch 12900, loss[loss=0.212, simple_loss=0.2807, pruned_loss=0.0717, over 21083.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3512, pruned_loss=0.1111, over 4279635.15 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 01:59:49,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.46 vs. limit=22.5 2023-06-19 01:59:54,192 INFO [train.py:996] (0/4) Epoch 2, batch 12950, loss[loss=0.2848, simple_loss=0.3463, pruned_loss=0.1116, over 21809.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3503, pruned_loss=0.1087, over 4280644.21 frames. ], batch size: 282, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:00:01,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.844e+02 3.449e+02 4.156e+02 6.439e+02, threshold=6.898e+02, percent-clipped=0.0 2023-06-19 02:01:30,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=260910.0, ans=0.0 2023-06-19 02:01:39,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260910.0, ans=0.1 2023-06-19 02:01:54,510 INFO [train.py:996] (0/4) Epoch 2, batch 13000, loss[loss=0.2463, simple_loss=0.3233, pruned_loss=0.08461, over 21751.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3532, pruned_loss=0.11, over 4273001.11 frames. ], batch size: 391, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:03:25,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=261150.0, ans=0.125 2023-06-19 02:04:04,224 INFO [train.py:996] (0/4) Epoch 2, batch 13050, loss[loss=0.3112, simple_loss=0.3697, pruned_loss=0.1264, over 21847.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3463, pruned_loss=0.1068, over 4279761.17 frames. ], batch size: 107, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:04:09,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.45 vs. limit=15.0 2023-06-19 02:04:11,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.722e+02 3.295e+02 4.146e+02 8.681e+02, threshold=6.589e+02, percent-clipped=5.0 2023-06-19 02:05:56,108 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2023-06-19 02:06:09,572 INFO [train.py:996] (0/4) Epoch 2, batch 13100, loss[loss=0.2792, simple_loss=0.3497, pruned_loss=0.1044, over 21776.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3465, pruned_loss=0.1074, over 4278346.88 frames. ], batch size: 298, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:06:10,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-19 02:06:26,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=261570.0, ans=0.2 2023-06-19 02:07:05,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=261630.0, ans=0.125 2023-06-19 02:07:33,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=261690.0, ans=0.125 2023-06-19 02:08:01,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=261750.0, ans=0.125 2023-06-19 02:08:22,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-19 02:08:39,313 INFO [train.py:996] (0/4) Epoch 2, batch 13150, loss[loss=0.2273, simple_loss=0.2971, pruned_loss=0.07879, over 21568.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3489, pruned_loss=0.1112, over 4277074.84 frames. ], batch size: 230, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:08:46,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.844e+02 3.521e+02 4.321e+02 7.421e+02, threshold=7.042e+02, percent-clipped=2.0 2023-06-19 02:08:55,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=261930.0, ans=0.015 2023-06-19 02:10:01,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=261990.0, ans=0.125 2023-06-19 02:10:14,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=262050.0, ans=0.0 2023-06-19 02:10:14,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=262050.0, ans=0.0 2023-06-19 02:10:19,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=262050.0, ans=0.125 2023-06-19 02:10:37,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=262110.0, ans=10.0 2023-06-19 02:10:44,895 INFO [train.py:996] (0/4) Epoch 2, batch 13200, loss[loss=0.3122, simple_loss=0.3685, pruned_loss=0.1279, over 21491.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3486, pruned_loss=0.1115, over 4280855.62 frames. ], batch size: 131, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:11:58,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=262290.0, ans=0.2 2023-06-19 02:11:59,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=262290.0, ans=0.125 2023-06-19 02:12:32,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262410.0, ans=0.1 2023-06-19 02:12:59,890 INFO [train.py:996] (0/4) Epoch 2, batch 13250, loss[loss=0.2831, simple_loss=0.3342, pruned_loss=0.116, over 21281.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3492, pruned_loss=0.1133, over 4274979.87 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:13:09,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.771e+02 3.301e+02 4.204e+02 7.419e+02, threshold=6.603e+02, percent-clipped=1.0 2023-06-19 02:13:53,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=262530.0, ans=0.09899494936611666 2023-06-19 02:14:39,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=262650.0, ans=0.2 2023-06-19 02:15:29,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=262770.0, ans=0.125 2023-06-19 02:15:30,032 INFO [train.py:996] (0/4) Epoch 2, batch 13300, loss[loss=0.3136, simple_loss=0.3716, pruned_loss=0.1278, over 21539.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3511, pruned_loss=0.1122, over 4280040.60 frames. ], batch size: 194, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:15:41,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=262770.0, ans=0.125 2023-06-19 02:16:10,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=262830.0, ans=0.125 2023-06-19 02:16:19,189 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:17:43,400 INFO [train.py:996] (0/4) Epoch 2, batch 13350, loss[loss=0.3239, simple_loss=0.3858, pruned_loss=0.131, over 21595.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3574, pruned_loss=0.1164, over 4285205.69 frames. ], batch size: 389, lr: 1.69e-02, grad_scale: 32.0 2023-06-19 02:18:00,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.775e+02 3.449e+02 3.875e+02 6.020e+02, threshold=6.898e+02, percent-clipped=0.0 2023-06-19 02:18:12,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-19 02:19:30,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=263310.0, ans=0.125 2023-06-19 02:19:45,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=263310.0, ans=0.2 2023-06-19 02:19:49,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=263310.0, ans=0.125 2023-06-19 02:20:06,316 INFO [train.py:996] (0/4) Epoch 2, batch 13400, loss[loss=0.3041, simple_loss=0.3596, pruned_loss=0.1243, over 21722.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3592, pruned_loss=0.1188, over 4284775.01 frames. ], batch size: 112, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:21:05,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263490.0, ans=0.1 2023-06-19 02:21:36,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=263550.0, ans=0.125 2023-06-19 02:22:05,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=263610.0, ans=0.07 2023-06-19 02:22:32,121 INFO [train.py:996] (0/4) Epoch 2, batch 13450, loss[loss=0.2689, simple_loss=0.3238, pruned_loss=0.107, over 21760.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.361, pruned_loss=0.1213, over 4275514.45 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:22:33,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-19 02:22:37,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=263670.0, ans=0.0 2023-06-19 02:22:38,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=263670.0, ans=0.0 2023-06-19 02:22:39,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 3.379e+02 3.829e+02 4.469e+02 7.112e+02, threshold=7.658e+02, percent-clipped=1.0 2023-06-19 02:24:10,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=263850.0, ans=0.0 2023-06-19 02:24:21,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=263910.0, ans=0.1 2023-06-19 02:24:24,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=263910.0, ans=0.125 2023-06-19 02:24:37,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=263910.0, ans=0.125 2023-06-19 02:24:40,494 INFO [train.py:996] (0/4) Epoch 2, batch 13500, loss[loss=0.3001, simple_loss=0.3549, pruned_loss=0.1227, over 21720.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3528, pruned_loss=0.1165, over 4273841.48 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:24:42,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=263970.0, ans=0.2 2023-06-19 02:24:47,077 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-44000.pt 2023-06-19 02:25:14,604 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:25:45,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=264090.0, ans=0.0 2023-06-19 02:26:21,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=264150.0, ans=10.0 2023-06-19 02:26:40,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=264210.0, ans=0.0 2023-06-19 02:26:49,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=264270.0, ans=0.0 2023-06-19 02:26:50,569 INFO [train.py:996] (0/4) Epoch 2, batch 13550, loss[loss=0.3246, simple_loss=0.406, pruned_loss=0.1216, over 21854.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3539, pruned_loss=0.1142, over 4265964.57 frames. ], batch size: 371, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:27:04,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.944e+02 3.381e+02 4.408e+02 7.046e+02, threshold=6.762e+02, percent-clipped=0.0 2023-06-19 02:27:53,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=264390.0, ans=0.125 2023-06-19 02:27:53,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=264390.0, ans=0.0 2023-06-19 02:28:35,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-19 02:29:05,567 INFO [train.py:996] (0/4) Epoch 2, batch 13600, loss[loss=0.3123, simple_loss=0.3609, pruned_loss=0.1318, over 21883.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3549, pruned_loss=0.115, over 4275373.04 frames. ], batch size: 414, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:29:50,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264630.0, ans=0.1 2023-06-19 02:30:44,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=264750.0, ans=0.125 2023-06-19 02:31:29,056 INFO [train.py:996] (0/4) Epoch 2, batch 13650, loss[loss=0.241, simple_loss=0.2879, pruned_loss=0.09705, over 21258.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3495, pruned_loss=0.1109, over 4271025.26 frames. ], batch size: 548, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:31:36,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-06-19 02:31:43,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.727e+02 3.110e+02 3.674e+02 7.098e+02, threshold=6.220e+02, percent-clipped=1.0 2023-06-19 02:32:48,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=264990.0, ans=0.125 2023-06-19 02:33:08,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265050.0, ans=0.1 2023-06-19 02:33:19,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-19 02:33:48,009 INFO [train.py:996] (0/4) Epoch 2, batch 13700, loss[loss=0.2591, simple_loss=0.3181, pruned_loss=0.1001, over 21762.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3444, pruned_loss=0.1111, over 4272539.03 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:35:20,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=265350.0, ans=0.0 2023-06-19 02:35:49,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-19 02:35:52,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265410.0, ans=0.1 2023-06-19 02:35:54,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-19 02:36:07,355 INFO [train.py:996] (0/4) Epoch 2, batch 13750, loss[loss=0.294, simple_loss=0.3665, pruned_loss=0.1108, over 21559.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3398, pruned_loss=0.1094, over 4267580.96 frames. ], batch size: 441, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:36:26,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.182e+02 3.904e+02 5.015e+02 8.772e+02, threshold=7.809e+02, percent-clipped=9.0 2023-06-19 02:36:28,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-06-19 02:37:01,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=265530.0, ans=22.5 2023-06-19 02:37:38,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=265650.0, ans=0.0 2023-06-19 02:38:33,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265710.0, ans=0.1 2023-06-19 02:38:42,203 INFO [train.py:996] (0/4) Epoch 2, batch 13800, loss[loss=0.2994, simple_loss=0.3884, pruned_loss=0.1052, over 21678.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3457, pruned_loss=0.1086, over 4268842.26 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:39:13,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=265830.0, ans=0.0 2023-06-19 02:39:32,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=265890.0, ans=0.0 2023-06-19 02:39:58,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=265890.0, ans=0.125 2023-06-19 02:40:51,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=266010.0, ans=0.0 2023-06-19 02:40:59,929 INFO [train.py:996] (0/4) Epoch 2, batch 13850, loss[loss=0.2927, simple_loss=0.3558, pruned_loss=0.1148, over 21384.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.351, pruned_loss=0.1099, over 4271960.56 frames. ], batch size: 211, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:41:00,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=266070.0, ans=0.05 2023-06-19 02:41:00,425 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:41:38,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.862e+02 3.371e+02 4.001e+02 6.906e+02, threshold=6.742e+02, percent-clipped=0.0 2023-06-19 02:43:32,052 INFO [train.py:996] (0/4) Epoch 2, batch 13900, loss[loss=0.2995, simple_loss=0.3474, pruned_loss=0.1258, over 21783.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.356, pruned_loss=0.1143, over 4271518.69 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-19 02:43:49,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=266370.0, ans=0.125 2023-06-19 02:45:01,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-19 02:45:13,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=266610.0, ans=0.0 2023-06-19 02:45:45,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=266610.0, ans=0.05 2023-06-19 02:45:48,057 INFO [train.py:996] (0/4) Epoch 2, batch 13950, loss[loss=0.2909, simple_loss=0.3442, pruned_loss=0.1188, over 21832.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3554, pruned_loss=0.1157, over 4281272.29 frames. ], batch size: 298, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:46:01,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 3.266e+02 3.802e+02 5.286e+02 1.041e+03, threshold=7.604e+02, percent-clipped=11.0 2023-06-19 02:46:10,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=266670.0, ans=0.125 2023-06-19 02:46:36,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=266730.0, ans=0.0 2023-06-19 02:46:39,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266790.0, ans=0.1 2023-06-19 02:46:54,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266790.0, ans=0.125 2023-06-19 02:47:42,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=266910.0, ans=0.0 2023-06-19 02:48:10,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=266910.0, ans=0.0 2023-06-19 02:48:21,250 INFO [train.py:996] (0/4) Epoch 2, batch 14000, loss[loss=0.2154, simple_loss=0.2819, pruned_loss=0.07446, over 21723.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3478, pruned_loss=0.1116, over 4260094.60 frames. ], batch size: 264, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:48:39,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=267030.0, ans=0.0 2023-06-19 02:48:42,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=267030.0, ans=0.125 2023-06-19 02:48:43,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=267030.0, ans=0.2 2023-06-19 02:49:15,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=267090.0, ans=0.125 2023-06-19 02:49:38,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-19 02:50:17,480 INFO [train.py:996] (0/4) Epoch 2, batch 14050, loss[loss=0.2408, simple_loss=0.319, pruned_loss=0.08132, over 21863.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3408, pruned_loss=0.1068, over 4262074.34 frames. ], batch size: 351, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:50:24,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 2.535e+02 3.106e+02 3.697e+02 6.124e+02, threshold=6.211e+02, percent-clipped=0.0 2023-06-19 02:50:59,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=267330.0, ans=0.035 2023-06-19 02:51:24,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267390.0, ans=0.1 2023-06-19 02:51:24,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=267390.0, ans=0.0 2023-06-19 02:51:28,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-19 02:52:24,988 INFO [train.py:996] (0/4) Epoch 2, batch 14100, loss[loss=0.2369, simple_loss=0.2812, pruned_loss=0.09625, over 20798.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3344, pruned_loss=0.1063, over 4262905.35 frames. ], batch size: 608, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:52:49,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=267630.0, ans=0.5 2023-06-19 02:52:51,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=267630.0, ans=0.125 2023-06-19 02:53:59,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=267810.0, ans=0.125 2023-06-19 02:54:06,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=267810.0, ans=0.125 2023-06-19 02:54:07,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=267810.0, ans=0.2 2023-06-19 02:54:11,523 INFO [train.py:996] (0/4) Epoch 2, batch 14150, loss[loss=0.2627, simple_loss=0.3418, pruned_loss=0.09177, over 21819.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3403, pruned_loss=0.1085, over 4249941.19 frames. ], batch size: 102, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:54:23,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.842e+02 3.231e+02 3.958e+02 7.482e+02, threshold=6.462e+02, percent-clipped=1.0 2023-06-19 02:54:51,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=267990.0, ans=0.2 2023-06-19 02:55:25,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-19 02:55:27,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=268050.0, ans=0.0 2023-06-19 02:55:43,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268110.0, ans=0.1 2023-06-19 02:56:01,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 02:56:02,058 INFO [train.py:996] (0/4) Epoch 2, batch 14200, loss[loss=0.2598, simple_loss=0.3072, pruned_loss=0.1062, over 21561.00 frames. ], tot_loss[loss=0.275, simple_loss=0.338, pruned_loss=0.106, over 4258143.01 frames. ], batch size: 263, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:56:25,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=268170.0, ans=0.125 2023-06-19 02:56:36,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=268230.0, ans=0.0 2023-06-19 02:56:43,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268230.0, ans=0.0 2023-06-19 02:56:43,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=268230.0, ans=0.2 2023-06-19 02:56:46,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=268230.0, ans=0.0 2023-06-19 02:57:04,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=268290.0, ans=0.125 2023-06-19 02:58:11,568 INFO [train.py:996] (0/4) Epoch 2, batch 14250, loss[loss=0.3106, simple_loss=0.4036, pruned_loss=0.1088, over 19701.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.334, pruned_loss=0.1067, over 4262752.86 frames. ], batch size: 703, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 02:58:25,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=268470.0, ans=0.07 2023-06-19 02:58:25,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 2.584e+02 3.251e+02 4.167e+02 7.132e+02, threshold=6.503e+02, percent-clipped=3.0 2023-06-19 02:58:28,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-19 02:59:17,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-19 02:59:18,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268650.0, ans=0.1 2023-06-19 02:59:43,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=268710.0, ans=0.2 2023-06-19 03:00:06,461 INFO [train.py:996] (0/4) Epoch 2, batch 14300, loss[loss=0.3813, simple_loss=0.449, pruned_loss=0.1567, over 21763.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3378, pruned_loss=0.1056, over 4256960.01 frames. ], batch size: 332, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:00:54,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268830.0, ans=0.1 2023-06-19 03:01:04,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=268890.0, ans=0.125 2023-06-19 03:02:16,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=269010.0, ans=0.0 2023-06-19 03:02:34,013 INFO [train.py:996] (0/4) Epoch 2, batch 14350, loss[loss=0.3222, simple_loss=0.3674, pruned_loss=0.1385, over 20008.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3449, pruned_loss=0.1065, over 4258276.57 frames. ], batch size: 702, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:02:44,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=269070.0, ans=0.95 2023-06-19 03:02:54,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.713e+02 3.120e+02 4.327e+02 9.558e+02, threshold=6.239e+02, percent-clipped=7.0 2023-06-19 03:02:55,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.65 vs. limit=22.5 2023-06-19 03:03:06,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=269130.0, ans=0.125 2023-06-19 03:03:31,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-19 03:03:53,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-19 03:04:08,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269310.0, ans=0.1 2023-06-19 03:04:23,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=269310.0, ans=0.2 2023-06-19 03:04:36,234 INFO [train.py:996] (0/4) Epoch 2, batch 14400, loss[loss=0.2965, simple_loss=0.3419, pruned_loss=0.1255, over 22021.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3428, pruned_loss=0.1091, over 4262709.82 frames. ], batch size: 103, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:04:36,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=269370.0, ans=0.125 2023-06-19 03:06:32,265 INFO [train.py:996] (0/4) Epoch 2, batch 14450, loss[loss=0.2919, simple_loss=0.3306, pruned_loss=0.1266, over 21589.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3367, pruned_loss=0.1091, over 4262494.55 frames. ], batch size: 441, lr: 1.67e-02, grad_scale: 32.0 2023-06-19 03:06:46,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.837e+02 3.414e+02 4.190e+02 7.999e+02, threshold=6.829e+02, percent-clipped=4.0 2023-06-19 03:07:26,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-06-19 03:08:07,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=269910.0, ans=0.2 2023-06-19 03:08:35,908 INFO [train.py:996] (0/4) Epoch 2, batch 14500, loss[loss=0.3143, simple_loss=0.3882, pruned_loss=0.1202, over 20927.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3335, pruned_loss=0.1086, over 4257536.23 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:09:22,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=270090.0, ans=0.125 2023-06-19 03:09:31,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-19 03:09:46,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=270150.0, ans=0.95 2023-06-19 03:10:00,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=270210.0, ans=0.0 2023-06-19 03:10:34,023 INFO [train.py:996] (0/4) Epoch 2, batch 14550, loss[loss=0.2421, simple_loss=0.2923, pruned_loss=0.09598, over 20688.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3379, pruned_loss=0.1099, over 4253393.32 frames. ], batch size: 607, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:10:55,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.690e+02 3.116e+02 3.745e+02 6.340e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-19 03:12:49,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=270510.0, ans=0.125 2023-06-19 03:13:00,814 INFO [train.py:996] (0/4) Epoch 2, batch 14600, loss[loss=0.3133, simple_loss=0.3814, pruned_loss=0.1226, over 21783.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3485, pruned_loss=0.1165, over 4262819.28 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:13:19,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=270630.0, ans=0.0 2023-06-19 03:14:04,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=270690.0, ans=0.0 2023-06-19 03:14:13,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=270690.0, ans=0.05 2023-06-19 03:14:27,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=270750.0, ans=0.2 2023-06-19 03:15:08,360 INFO [train.py:996] (0/4) Epoch 2, batch 14650, loss[loss=0.2045, simple_loss=0.2888, pruned_loss=0.06011, over 21623.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3509, pruned_loss=0.1146, over 4268856.97 frames. ], batch size: 263, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:15:28,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.832e+02 3.402e+02 3.897e+02 6.395e+02, threshold=6.804e+02, percent-clipped=2.0 2023-06-19 03:16:49,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271050.0, ans=0.0 2023-06-19 03:17:14,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=271110.0, ans=0.0 2023-06-19 03:17:32,068 INFO [train.py:996] (0/4) Epoch 2, batch 14700, loss[loss=0.2713, simple_loss=0.3467, pruned_loss=0.09793, over 21649.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.342, pruned_loss=0.1072, over 4256958.90 frames. ], batch size: 263, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:18:09,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=271230.0, ans=0.0 2023-06-19 03:18:26,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=271230.0, ans=0.07 2023-06-19 03:18:56,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=271350.0, ans=0.04949747468305833 2023-06-19 03:19:41,506 INFO [train.py:996] (0/4) Epoch 2, batch 14750, loss[loss=0.2836, simple_loss=0.328, pruned_loss=0.1196, over 21192.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3489, pruned_loss=0.1101, over 4264429.54 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:20:21,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 2.730e+02 3.490e+02 4.322e+02 1.005e+03, threshold=6.981e+02, percent-clipped=5.0 2023-06-19 03:20:26,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=271530.0, ans=0.0 2023-06-19 03:20:56,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271590.0, ans=0.1 2023-06-19 03:21:43,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=271710.0, ans=0.125 2023-06-19 03:21:56,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271710.0, ans=0.1 2023-06-19 03:22:05,206 INFO [train.py:996] (0/4) Epoch 2, batch 14800, loss[loss=0.3588, simple_loss=0.4043, pruned_loss=0.1567, over 21554.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3605, pruned_loss=0.1162, over 4264412.60 frames. ], batch size: 414, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:22:14,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271770.0, ans=0.1 2023-06-19 03:22:59,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-19 03:24:07,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272010.0, ans=0.1 2023-06-19 03:24:36,334 INFO [train.py:996] (0/4) Epoch 2, batch 14850, loss[loss=0.3307, simple_loss=0.3886, pruned_loss=0.1364, over 21623.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3532, pruned_loss=0.1159, over 4266605.69 frames. ], batch size: 389, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:24:45,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.062e+02 3.650e+02 4.413e+02 7.562e+02, threshold=7.301e+02, percent-clipped=4.0 2023-06-19 03:25:47,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=272190.0, ans=0.125 2023-06-19 03:26:49,816 INFO [train.py:996] (0/4) Epoch 2, batch 14900, loss[loss=0.3008, simple_loss=0.3569, pruned_loss=0.1224, over 21978.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3562, pruned_loss=0.1182, over 4269404.79 frames. ], batch size: 317, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:27:04,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=272370.0, ans=0.2 2023-06-19 03:27:30,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=272430.0, ans=0.0 2023-06-19 03:27:53,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=272490.0, ans=0.125 2023-06-19 03:28:15,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=272550.0, ans=0.2 2023-06-19 03:28:23,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=272550.0, ans=0.0 2023-06-19 03:28:34,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-19 03:28:53,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.28 vs. limit=10.0 2023-06-19 03:29:10,898 INFO [train.py:996] (0/4) Epoch 2, batch 14950, loss[loss=0.2917, simple_loss=0.3591, pruned_loss=0.1121, over 21966.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3556, pruned_loss=0.1163, over 4274103.22 frames. ], batch size: 317, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:29:25,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.792e+02 3.355e+02 4.143e+02 6.575e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-19 03:30:29,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=272790.0, ans=0.125 2023-06-19 03:30:40,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272850.0, ans=0.1 2023-06-19 03:31:17,960 INFO [train.py:996] (0/4) Epoch 2, batch 15000, loss[loss=0.2958, simple_loss=0.3754, pruned_loss=0.1081, over 20757.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3595, pruned_loss=0.1192, over 4276339.52 frames. ], batch size: 607, lr: 1.66e-02, grad_scale: 32.0 2023-06-19 03:31:17,961 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 03:32:09,028 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.272, simple_loss=0.3679, pruned_loss=0.08803, over 1796401.00 frames. 2023-06-19 03:32:09,029 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 03:32:46,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=273090.0, ans=10.0 2023-06-19 03:33:07,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=273150.0, ans=0.125 2023-06-19 03:33:31,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=273150.0, ans=0.0 2023-06-19 03:33:47,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273210.0, ans=0.0 2023-06-19 03:34:02,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=273210.0, ans=0.015 2023-06-19 03:34:11,071 INFO [train.py:996] (0/4) Epoch 2, batch 15050, loss[loss=0.4127, simple_loss=0.477, pruned_loss=0.1742, over 20863.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3614, pruned_loss=0.1201, over 4275124.40 frames. ], batch size: 607, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:34:28,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.105e+02 3.767e+02 5.029e+02 7.583e+02, threshold=7.535e+02, percent-clipped=4.0 2023-06-19 03:34:35,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=273330.0, ans=0.02 2023-06-19 03:34:35,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2023-06-19 03:34:56,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273330.0, ans=0.1 2023-06-19 03:36:06,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-19 03:36:22,093 INFO [train.py:996] (0/4) Epoch 2, batch 15100, loss[loss=0.351, simple_loss=0.4413, pruned_loss=0.1304, over 19751.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3634, pruned_loss=0.119, over 4267000.35 frames. ], batch size: 702, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:37:37,291 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:38:16,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=273810.0, ans=0.025 2023-06-19 03:38:41,666 INFO [train.py:996] (0/4) Epoch 2, batch 15150, loss[loss=0.3203, simple_loss=0.3473, pruned_loss=0.1467, over 21351.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3574, pruned_loss=0.1183, over 4266971.47 frames. ], batch size: 473, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:38:43,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=273870.0, ans=0.125 2023-06-19 03:38:56,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.943e+02 3.299e+02 3.908e+02 6.468e+02, threshold=6.598e+02, percent-clipped=0.0 2023-06-19 03:39:01,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=273870.0, ans=0.125 2023-06-19 03:40:09,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=274050.0, ans=0.125 2023-06-19 03:40:10,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=274050.0, ans=0.2 2023-06-19 03:40:23,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=274110.0, ans=15.0 2023-06-19 03:40:36,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=274170.0, ans=0.125 2023-06-19 03:40:36,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=274170.0, ans=0.2 2023-06-19 03:40:37,432 INFO [train.py:996] (0/4) Epoch 2, batch 15200, loss[loss=0.254, simple_loss=0.3382, pruned_loss=0.08486, over 21581.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.347, pruned_loss=0.1128, over 4261319.49 frames. ], batch size: 389, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:40:38,253 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:41:11,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=274290.0, ans=0.2 2023-06-19 03:41:28,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=274290.0, ans=0.125 2023-06-19 03:42:21,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274410.0, ans=0.125 2023-06-19 03:42:28,463 INFO [train.py:996] (0/4) Epoch 2, batch 15250, loss[loss=0.3205, simple_loss=0.3557, pruned_loss=0.1427, over 21589.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3416, pruned_loss=0.1115, over 4254638.24 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:42:42,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.617e+02 3.072e+02 3.337e+02 6.038e+02, threshold=6.144e+02, percent-clipped=0.0 2023-06-19 03:44:11,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=274650.0, ans=0.125 2023-06-19 03:44:17,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=274710.0, ans=0.0 2023-06-19 03:44:21,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-19 03:44:37,230 INFO [train.py:996] (0/4) Epoch 2, batch 15300, loss[loss=0.3119, simple_loss=0.3716, pruned_loss=0.1261, over 21641.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3465, pruned_loss=0.1158, over 4254817.28 frames. ], batch size: 113, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:45:27,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=274830.0, ans=0.2 2023-06-19 03:46:09,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-19 03:46:23,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=275010.0, ans=0.2 2023-06-19 03:46:38,760 INFO [train.py:996] (0/4) Epoch 2, batch 15350, loss[loss=0.2749, simple_loss=0.3501, pruned_loss=0.09989, over 21784.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3538, pruned_loss=0.1184, over 4260808.71 frames. ], batch size: 282, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:46:52,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=275070.0, ans=0.0 2023-06-19 03:47:11,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.136e+02 3.689e+02 4.822e+02 8.057e+02, threshold=7.379e+02, percent-clipped=7.0 2023-06-19 03:47:17,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-06-19 03:48:01,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=275190.0, ans=0.125 2023-06-19 03:48:01,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=275190.0, ans=0.125 2023-06-19 03:48:52,822 INFO [train.py:996] (0/4) Epoch 2, batch 15400, loss[loss=0.2932, simple_loss=0.3488, pruned_loss=0.1187, over 21821.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3562, pruned_loss=0.1167, over 4264218.57 frames. ], batch size: 414, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:49:59,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=275490.0, ans=0.125 2023-06-19 03:50:36,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=275610.0, ans=0.0 2023-06-19 03:50:42,046 INFO [train.py:996] (0/4) Epoch 2, batch 15450, loss[loss=0.2747, simple_loss=0.3364, pruned_loss=0.1065, over 21851.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3525, pruned_loss=0.1146, over 4271032.67 frames. ], batch size: 107, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:51:14,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.540e+02 3.062e+02 3.855e+02 5.645e+02, threshold=6.124e+02, percent-clipped=0.0 2023-06-19 03:51:18,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275670.0, ans=0.1 2023-06-19 03:51:46,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=275730.0, ans=0.125 2023-06-19 03:51:53,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-06-19 03:52:13,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=275790.0, ans=0.125 2023-06-19 03:53:13,390 INFO [train.py:996] (0/4) Epoch 2, batch 15500, loss[loss=0.416, simple_loss=0.4374, pruned_loss=0.1973, over 21354.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3551, pruned_loss=0.1153, over 4268070.80 frames. ], batch size: 507, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:53:19,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275970.0, ans=0.1 2023-06-19 03:54:41,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=276150.0, ans=0.125 2023-06-19 03:55:28,551 INFO [train.py:996] (0/4) Epoch 2, batch 15550, loss[loss=0.2364, simple_loss=0.3051, pruned_loss=0.0839, over 21343.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3506, pruned_loss=0.1125, over 4261266.14 frames. ], batch size: 194, lr: 1.65e-02, grad_scale: 32.0 2023-06-19 03:55:49,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.861e+02 3.609e+02 4.664e+02 1.166e+03, threshold=7.218e+02, percent-clipped=12.0 2023-06-19 03:56:09,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=276330.0, ans=0.125 2023-06-19 03:57:30,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=276510.0, ans=0.125 2023-06-19 03:57:36,374 INFO [train.py:996] (0/4) Epoch 2, batch 15600, loss[loss=0.2649, simple_loss=0.3335, pruned_loss=0.09811, over 21601.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3435, pruned_loss=0.1109, over 4261610.31 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 03:58:05,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=276570.0, ans=0.125 2023-06-19 03:58:06,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=276570.0, ans=0.0 2023-06-19 03:58:47,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=276690.0, ans=0.125 2023-06-19 03:58:55,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=276750.0, ans=0.2 2023-06-19 03:59:20,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-19 03:59:32,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=276810.0, ans=0.05 2023-06-19 03:59:36,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=276810.0, ans=0.09899494936611666 2023-06-19 03:59:39,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=276810.0, ans=0.0 2023-06-19 04:00:05,933 INFO [train.py:996] (0/4) Epoch 2, batch 15650, loss[loss=0.2588, simple_loss=0.3191, pruned_loss=0.09921, over 21629.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3413, pruned_loss=0.1104, over 4257410.95 frames. ], batch size: 332, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:00:14,621 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.975e+02 3.392e+02 4.478e+02 7.977e+02, threshold=6.785e+02, percent-clipped=4.0 2023-06-19 04:00:56,350 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:02:03,676 INFO [train.py:996] (0/4) Epoch 2, batch 15700, loss[loss=0.2623, simple_loss=0.3181, pruned_loss=0.1033, over 21822.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3362, pruned_loss=0.1083, over 4262493.89 frames. ], batch size: 352, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:02:55,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277230.0, ans=0.1 2023-06-19 04:03:10,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-19 04:03:14,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=277290.0, ans=0.125 2023-06-19 04:03:32,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=277350.0, ans=0.0 2023-06-19 04:04:04,417 INFO [train.py:996] (0/4) Epoch 2, batch 15750, loss[loss=0.2562, simple_loss=0.3092, pruned_loss=0.1016, over 21539.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3306, pruned_loss=0.1069, over 4258165.64 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:04:20,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277470.0, ans=0.1 2023-06-19 04:04:22,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.529e+02 2.886e+02 3.398e+02 5.891e+02, threshold=5.773e+02, percent-clipped=0.0 2023-06-19 04:04:35,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277530.0, ans=0.1 2023-06-19 04:06:11,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-19 04:06:19,594 INFO [train.py:996] (0/4) Epoch 2, batch 15800, loss[loss=0.2411, simple_loss=0.2942, pruned_loss=0.094, over 21674.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3268, pruned_loss=0.1065, over 4261656.42 frames. ], batch size: 298, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:06:43,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-19 04:06:46,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=277830.0, ans=0.0 2023-06-19 04:07:27,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=277890.0, ans=0.125 2023-06-19 04:07:51,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-19 04:08:40,430 INFO [train.py:996] (0/4) Epoch 2, batch 15850, loss[loss=0.3269, simple_loss=0.3755, pruned_loss=0.1392, over 21570.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3304, pruned_loss=0.1096, over 4258691.50 frames. ], batch size: 415, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:08:41,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-19 04:08:45,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=278070.0, ans=0.125 2023-06-19 04:08:49,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.859e+02 3.359e+02 4.336e+02 6.556e+02, threshold=6.719e+02, percent-clipped=7.0 2023-06-19 04:09:41,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-19 04:10:16,042 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:10:23,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=278310.0, ans=0.1 2023-06-19 04:10:38,061 INFO [train.py:996] (0/4) Epoch 2, batch 15900, loss[loss=0.2685, simple_loss=0.3496, pruned_loss=0.09374, over 21665.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3323, pruned_loss=0.1106, over 4258794.13 frames. ], batch size: 298, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:12:36,282 INFO [train.py:996] (0/4) Epoch 2, batch 15950, loss[loss=0.2565, simple_loss=0.3245, pruned_loss=0.09424, over 21463.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3337, pruned_loss=0.1086, over 4246282.74 frames. ], batch size: 131, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:13:02,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.565e+02 3.122e+02 3.960e+02 8.698e+02, threshold=6.245e+02, percent-clipped=1.0 2023-06-19 04:14:08,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=278850.0, ans=10.0 2023-06-19 04:14:43,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=278910.0, ans=0.125 2023-06-19 04:14:44,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=278910.0, ans=0.1 2023-06-19 04:15:07,375 INFO [train.py:996] (0/4) Epoch 2, batch 16000, loss[loss=0.2591, simple_loss=0.3438, pruned_loss=0.0872, over 21852.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3344, pruned_loss=0.1059, over 4249454.77 frames. ], batch size: 371, lr: 1.64e-02, grad_scale: 32.0 2023-06-19 04:15:39,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279030.0, ans=0.1 2023-06-19 04:16:34,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-19 04:17:13,456 INFO [train.py:996] (0/4) Epoch 2, batch 16050, loss[loss=0.2375, simple_loss=0.3223, pruned_loss=0.07634, over 21617.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3363, pruned_loss=0.1037, over 4260180.88 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:17:30,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.605e+02 3.348e+02 4.219e+02 7.021e+02, threshold=6.696e+02, percent-clipped=3.0 2023-06-19 04:17:33,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-19 04:17:36,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-19 04:18:11,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=279390.0, ans=0.07 2023-06-19 04:19:03,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=279510.0, ans=0.125 2023-06-19 04:19:06,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=279510.0, ans=0.125 2023-06-19 04:19:19,763 INFO [train.py:996] (0/4) Epoch 2, batch 16100, loss[loss=0.3022, simple_loss=0.3564, pruned_loss=0.124, over 21734.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3378, pruned_loss=0.1043, over 4261524.27 frames. ], batch size: 389, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:19:23,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=279570.0, ans=0.0 2023-06-19 04:19:26,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=279570.0, ans=0.125 2023-06-19 04:19:38,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=279630.0, ans=0.125 2023-06-19 04:20:27,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=279750.0, ans=0.0 2023-06-19 04:20:29,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279750.0, ans=0.1 2023-06-19 04:21:27,814 INFO [train.py:996] (0/4) Epoch 2, batch 16150, loss[loss=0.2564, simple_loss=0.3178, pruned_loss=0.09755, over 21438.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3376, pruned_loss=0.1074, over 4276317.53 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 16.0 2023-06-19 04:21:48,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-19 04:21:57,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.251e+02 4.179e+02 4.916e+02 8.388e+02, threshold=8.358e+02, percent-clipped=5.0 2023-06-19 04:22:08,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=279930.0, ans=15.0 2023-06-19 04:22:41,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=280050.0, ans=0.0 2023-06-19 04:22:44,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280050.0, ans=0.1 2023-06-19 04:23:18,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280110.0, ans=0.1 2023-06-19 04:23:36,111 INFO [train.py:996] (0/4) Epoch 2, batch 16200, loss[loss=0.3744, simple_loss=0.4189, pruned_loss=0.165, over 21439.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3421, pruned_loss=0.1094, over 4274943.44 frames. ], batch size: 471, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:24:57,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280290.0, ans=0.1 2023-06-19 04:25:17,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280350.0, ans=0.1 2023-06-19 04:25:17,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=280350.0, ans=10.0 2023-06-19 04:25:56,552 INFO [train.py:996] (0/4) Epoch 2, batch 16250, loss[loss=0.1853, simple_loss=0.2542, pruned_loss=0.0582, over 21744.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3414, pruned_loss=0.1088, over 4274465.39 frames. ], batch size: 112, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:26:14,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.660e+02 3.048e+02 3.476e+02 5.105e+02, threshold=6.097e+02, percent-clipped=0.0 2023-06-19 04:26:47,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=280590.0, ans=10.0 2023-06-19 04:27:18,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=280710.0, ans=0.125 2023-06-19 04:27:46,216 INFO [train.py:996] (0/4) Epoch 2, batch 16300, loss[loss=0.2204, simple_loss=0.2856, pruned_loss=0.07766, over 21776.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3338, pruned_loss=0.1032, over 4275197.99 frames. ], batch size: 112, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:28:06,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=280770.0, ans=0.125 2023-06-19 04:28:52,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=280890.0, ans=0.2 2023-06-19 04:29:03,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=280950.0, ans=0.2 2023-06-19 04:29:21,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2023-06-19 04:29:47,571 INFO [train.py:996] (0/4) Epoch 2, batch 16350, loss[loss=0.3592, simple_loss=0.4225, pruned_loss=0.1479, over 21855.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3371, pruned_loss=0.105, over 4272801.55 frames. ], batch size: 124, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:29:47,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=281070.0, ans=0.0 2023-06-19 04:30:23,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.593e+02 3.131e+02 3.921e+02 6.968e+02, threshold=6.263e+02, percent-clipped=2.0 2023-06-19 04:30:24,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-19 04:31:03,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=281190.0, ans=0.125 2023-06-19 04:31:53,151 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:32:19,369 INFO [train.py:996] (0/4) Epoch 2, batch 16400, loss[loss=0.2866, simple_loss=0.3419, pruned_loss=0.1157, over 21406.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3424, pruned_loss=0.1072, over 4278002.90 frames. ], batch size: 144, lr: 1.63e-02, grad_scale: 32.0 2023-06-19 04:33:24,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=281550.0, ans=0.125 2023-06-19 04:33:41,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=281550.0, ans=0.07 2023-06-19 04:34:12,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281610.0, ans=0.1 2023-06-19 04:34:38,584 INFO [train.py:996] (0/4) Epoch 2, batch 16450, loss[loss=0.2573, simple_loss=0.3073, pruned_loss=0.1037, over 21560.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3415, pruned_loss=0.1085, over 4286437.97 frames. ], batch size: 212, lr: 1.63e-02, grad_scale: 32.0 2023-06-19 04:34:52,040 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:34:55,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=281670.0, ans=0.0 2023-06-19 04:34:56,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.650e+02 3.270e+02 4.334e+02 6.545e+02, threshold=6.541e+02, percent-clipped=2.0 2023-06-19 04:34:58,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=281730.0, ans=0.125 2023-06-19 04:35:34,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=281850.0, ans=0.125 2023-06-19 04:36:41,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=281910.0, ans=0.0 2023-06-19 04:36:49,343 INFO [train.py:996] (0/4) Epoch 2, batch 16500, loss[loss=0.2017, simple_loss=0.2494, pruned_loss=0.07699, over 21206.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3425, pruned_loss=0.1093, over 4285876.50 frames. ], batch size: 159, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:38:13,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=282090.0, ans=0.0 2023-06-19 04:38:15,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=282090.0, ans=0.125 2023-06-19 04:39:00,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-19 04:39:25,955 INFO [train.py:996] (0/4) Epoch 2, batch 16550, loss[loss=0.2591, simple_loss=0.3316, pruned_loss=0.09336, over 21807.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3359, pruned_loss=0.1042, over 4281375.15 frames. ], batch size: 282, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:39:38,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.903e+02 3.447e+02 4.239e+02 9.534e+02, threshold=6.894e+02, percent-clipped=2.0 2023-06-19 04:39:42,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=282330.0, ans=0.2 2023-06-19 04:39:59,951 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:40:31,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=282390.0, ans=0.125 2023-06-19 04:40:32,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=282390.0, ans=0.125 2023-06-19 04:40:34,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=282390.0, ans=0.0 2023-06-19 04:41:42,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-19 04:41:46,296 INFO [train.py:996] (0/4) Epoch 2, batch 16600, loss[loss=0.3144, simple_loss=0.3831, pruned_loss=0.1228, over 21824.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3455, pruned_loss=0.1084, over 4277530.67 frames. ], batch size: 118, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:41:52,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=282570.0, ans=0.0 2023-06-19 04:41:52,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282570.0, ans=0.125 2023-06-19 04:41:53,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-19 04:42:00,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-06-19 04:42:08,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.90 vs. limit=12.0 2023-06-19 04:42:49,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=282690.0, ans=0.0 2023-06-19 04:42:56,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=282690.0, ans=0.125 2023-06-19 04:44:14,849 INFO [train.py:996] (0/4) Epoch 2, batch 16650, loss[loss=0.3162, simple_loss=0.3786, pruned_loss=0.1269, over 21996.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.355, pruned_loss=0.1119, over 4272638.04 frames. ], batch size: 317, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:44:21,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=282870.0, ans=0.0 2023-06-19 04:44:28,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.764e+02 3.302e+02 3.864e+02 7.360e+02, threshold=6.604e+02, percent-clipped=1.0 2023-06-19 04:45:17,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=282990.0, ans=22.5 2023-06-19 04:45:28,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283050.0, ans=0.125 2023-06-19 04:45:52,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283050.0, ans=0.125 2023-06-19 04:46:21,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=283110.0, ans=0.125 2023-06-19 04:46:28,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=283110.0, ans=0.125 2023-06-19 04:46:32,966 INFO [train.py:996] (0/4) Epoch 2, batch 16700, loss[loss=0.2112, simple_loss=0.2681, pruned_loss=0.07714, over 21885.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3547, pruned_loss=0.1121, over 4269770.85 frames. ], batch size: 98, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:46:42,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283170.0, ans=0.1 2023-06-19 04:49:05,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-19 04:49:06,297 INFO [train.py:996] (0/4) Epoch 2, batch 16750, loss[loss=0.3519, simple_loss=0.4162, pruned_loss=0.1438, over 21594.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3611, pruned_loss=0.1158, over 4263255.67 frames. ], batch size: 414, lr: 1.63e-02, grad_scale: 16.0 2023-06-19 04:49:32,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.141e+02 3.831e+02 4.600e+02 7.842e+02, threshold=7.663e+02, percent-clipped=1.0 2023-06-19 04:50:07,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=283530.0, ans=0.125 2023-06-19 04:50:21,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-19 04:50:27,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=283590.0, ans=0.0 2023-06-19 04:51:06,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283650.0, ans=0.125 2023-06-19 04:51:23,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283710.0, ans=0.125 2023-06-19 04:51:37,206 INFO [train.py:996] (0/4) Epoch 2, batch 16800, loss[loss=0.2822, simple_loss=0.3405, pruned_loss=0.1119, over 21966.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3629, pruned_loss=0.1158, over 4253154.44 frames. ], batch size: 113, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:52:22,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=15.0 2023-06-19 04:52:41,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=283890.0, ans=0.0 2023-06-19 04:52:48,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=283890.0, ans=0.09899494936611666 2023-06-19 04:52:51,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=283890.0, ans=0.125 2023-06-19 04:52:55,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-19 04:53:54,893 INFO [train.py:996] (0/4) Epoch 2, batch 16850, loss[loss=0.2902, simple_loss=0.3415, pruned_loss=0.1195, over 21350.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3587, pruned_loss=0.1164, over 4264701.45 frames. ], batch size: 159, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:54:16,972 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 3.232e+02 3.798e+02 4.468e+02 9.826e+02, threshold=7.596e+02, percent-clipped=4.0 2023-06-19 04:54:59,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=284190.0, ans=0.125 2023-06-19 04:55:00,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=284190.0, ans=0.2 2023-06-19 04:55:10,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=284190.0, ans=0.125 2023-06-19 04:55:12,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=284190.0, ans=0.04949747468305833 2023-06-19 04:56:00,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=284310.0, ans=0.2 2023-06-19 04:56:12,104 INFO [train.py:996] (0/4) Epoch 2, batch 16900, loss[loss=0.3252, simple_loss=0.3869, pruned_loss=0.1317, over 20665.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3527, pruned_loss=0.1142, over 4273089.85 frames. ], batch size: 607, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:57:26,612 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:57:45,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=284550.0, ans=0.125 2023-06-19 04:58:04,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=284610.0, ans=0.0 2023-06-19 04:58:33,308 INFO [train.py:996] (0/4) Epoch 2, batch 16950, loss[loss=0.2695, simple_loss=0.3243, pruned_loss=0.1073, over 21910.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3477, pruned_loss=0.1123, over 4271413.72 frames. ], batch size: 351, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 04:58:38,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.82 vs. limit=6.0 2023-06-19 04:58:46,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.811e+02 3.359e+02 4.146e+02 6.829e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-19 04:59:34,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.19 vs. limit=22.5 2023-06-19 04:59:52,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=284850.0, ans=0.125 2023-06-19 05:00:11,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=284910.0, ans=0.0 2023-06-19 05:00:36,294 INFO [train.py:996] (0/4) Epoch 2, batch 17000, loss[loss=0.3105, simple_loss=0.3478, pruned_loss=0.1366, over 21632.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3463, pruned_loss=0.1129, over 4282604.79 frames. ], batch size: 471, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:01:31,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-19 05:01:36,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=285090.0, ans=0.2 2023-06-19 05:01:49,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-19 05:02:45,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=285210.0, ans=0.0 2023-06-19 05:02:52,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=12.0 2023-06-19 05:03:07,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=285210.0, ans=0.125 2023-06-19 05:03:09,671 INFO [train.py:996] (0/4) Epoch 2, batch 17050, loss[loss=0.3415, simple_loss=0.4011, pruned_loss=0.1409, over 21801.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3541, pruned_loss=0.1157, over 4286415.00 frames. ], batch size: 414, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:03:28,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.955e+02 3.306e+02 4.122e+02 6.323e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 05:03:29,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285330.0, ans=0.1 2023-06-19 05:04:02,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=285390.0, ans=0.0 2023-06-19 05:04:11,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=285390.0, ans=0.125 2023-06-19 05:04:39,289 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:04:40,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285450.0, ans=0.1 2023-06-19 05:04:46,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=285450.0, ans=0.0 2023-06-19 05:04:46,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=285450.0, ans=0.125 2023-06-19 05:05:09,709 INFO [train.py:996] (0/4) Epoch 2, batch 17100, loss[loss=0.2543, simple_loss=0.3139, pruned_loss=0.09734, over 21443.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3527, pruned_loss=0.1152, over 4292513.56 frames. ], batch size: 211, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:05:32,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-19 05:05:54,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=285630.0, ans=0.035 2023-06-19 05:07:26,642 INFO [train.py:996] (0/4) Epoch 2, batch 17150, loss[loss=0.2973, simple_loss=0.3382, pruned_loss=0.1282, over 21581.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3488, pruned_loss=0.1149, over 4294753.09 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:07:27,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=8.0 2023-06-19 05:08:02,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.782e+02 3.187e+02 4.144e+02 6.035e+02, threshold=6.374e+02, percent-clipped=0.0 2023-06-19 05:08:10,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=285930.0, ans=0.125 2023-06-19 05:08:39,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=285990.0, ans=0.125 2023-06-19 05:08:46,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=286050.0, ans=0.125 2023-06-19 05:08:54,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-19 05:09:35,270 INFO [train.py:996] (0/4) Epoch 2, batch 17200, loss[loss=0.3431, simple_loss=0.4398, pruned_loss=0.1232, over 20885.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3488, pruned_loss=0.1147, over 4292600.22 frames. ], batch size: 607, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:09:53,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=286170.0, ans=0.125 2023-06-19 05:10:09,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286170.0, ans=0.1 2023-06-19 05:10:11,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=286230.0, ans=0.0 2023-06-19 05:10:16,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=286230.0, ans=0.125 2023-06-19 05:10:41,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=286290.0, ans=0.0 2023-06-19 05:11:01,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-19 05:11:55,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=286410.0, ans=0.125 2023-06-19 05:12:03,955 INFO [train.py:996] (0/4) Epoch 2, batch 17250, loss[loss=0.285, simple_loss=0.3521, pruned_loss=0.1089, over 21803.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3545, pruned_loss=0.1179, over 4292340.61 frames. ], batch size: 282, lr: 1.62e-02, grad_scale: 32.0 2023-06-19 05:12:07,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=286470.0, ans=0.0 2023-06-19 05:12:46,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.005e+02 3.499e+02 4.330e+02 7.906e+02, threshold=6.999e+02, percent-clipped=5.0 2023-06-19 05:13:11,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=286590.0, ans=0.0 2023-06-19 05:13:39,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=286650.0, ans=0.125 2023-06-19 05:13:40,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=286650.0, ans=0.2 2023-06-19 05:13:49,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=286650.0, ans=0.0 2023-06-19 05:14:06,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=286710.0, ans=0.125 2023-06-19 05:14:30,721 INFO [train.py:996] (0/4) Epoch 2, batch 17300, loss[loss=0.3046, simple_loss=0.3571, pruned_loss=0.1261, over 21735.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3613, pruned_loss=0.1209, over 4293444.22 frames. ], batch size: 247, lr: 1.62e-02, grad_scale: 16.0 2023-06-19 05:14:33,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 05:15:22,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=286890.0, ans=0.125 2023-06-19 05:15:38,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=286890.0, ans=0.125 2023-06-19 05:16:42,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=287010.0, ans=0.07 2023-06-19 05:17:10,024 INFO [train.py:996] (0/4) Epoch 2, batch 17350, loss[loss=0.2808, simple_loss=0.373, pruned_loss=0.09428, over 21255.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3622, pruned_loss=0.1208, over 4293611.16 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 16.0 2023-06-19 05:17:31,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.184e+02 3.862e+02 4.907e+02 9.344e+02, threshold=7.725e+02, percent-clipped=8.0 2023-06-19 05:18:31,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-19 05:18:42,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=287250.0, ans=0.015 2023-06-19 05:19:10,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-19 05:19:11,902 INFO [train.py:996] (0/4) Epoch 2, batch 17400, loss[loss=0.2564, simple_loss=0.3602, pruned_loss=0.07634, over 19832.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3577, pruned_loss=0.1164, over 4287552.46 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:19:49,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=287430.0, ans=0.0 2023-06-19 05:21:14,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=287610.0, ans=0.0 2023-06-19 05:21:27,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=287610.0, ans=0.125 2023-06-19 05:21:39,830 INFO [train.py:996] (0/4) Epoch 2, batch 17450, loss[loss=0.2228, simple_loss=0.3091, pruned_loss=0.06822, over 21786.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3527, pruned_loss=0.1124, over 4281405.41 frames. ], batch size: 333, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:22:16,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.812e+02 3.482e+02 4.322e+02 6.614e+02, threshold=6.964e+02, percent-clipped=0.0 2023-06-19 05:22:47,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=287790.0, ans=0.125 2023-06-19 05:23:54,463 INFO [train.py:996] (0/4) Epoch 2, batch 17500, loss[loss=0.2817, simple_loss=0.3809, pruned_loss=0.0913, over 20805.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3481, pruned_loss=0.1094, over 4287108.30 frames. ], batch size: 608, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:24:00,594 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-48000.pt 2023-06-19 05:24:05,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=287970.0, ans=0.125 2023-06-19 05:25:36,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-19 05:25:56,540 INFO [train.py:996] (0/4) Epoch 2, batch 17550, loss[loss=0.2641, simple_loss=0.3459, pruned_loss=0.09113, over 21773.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3485, pruned_loss=0.108, over 4284654.39 frames. ], batch size: 124, lr: 1.61e-02, grad_scale: 16.0 2023-06-19 05:26:08,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=288270.0, ans=0.2 2023-06-19 05:26:10,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.669e+02 3.304e+02 4.133e+02 8.142e+02, threshold=6.607e+02, percent-clipped=6.0 2023-06-19 05:26:20,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=288330.0, ans=0.2 2023-06-19 05:26:22,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=288330.0, ans=0.125 2023-06-19 05:26:30,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=288330.0, ans=0.125 2023-06-19 05:27:54,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=288510.0, ans=0.0 2023-06-19 05:27:57,422 INFO [train.py:996] (0/4) Epoch 2, batch 17600, loss[loss=0.2581, simple_loss=0.3406, pruned_loss=0.08779, over 21589.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3487, pruned_loss=0.1075, over 4269921.48 frames. ], batch size: 112, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:27:57,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=288570.0, ans=0.125 2023-06-19 05:28:07,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=288570.0, ans=0.0 2023-06-19 05:28:08,539 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:29:12,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2023-06-19 05:29:53,828 INFO [train.py:996] (0/4) Epoch 2, batch 17650, loss[loss=0.2989, simple_loss=0.337, pruned_loss=0.1304, over 20143.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3464, pruned_loss=0.1073, over 4270181.77 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:30:34,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.529e+02 3.160e+02 3.730e+02 7.099e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-19 05:30:36,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=288930.0, ans=0.125 2023-06-19 05:30:46,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288930.0, ans=0.125 2023-06-19 05:30:50,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-19 05:31:14,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=289050.0, ans=0.125 2023-06-19 05:31:39,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=289050.0, ans=0.125 2023-06-19 05:31:46,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=289110.0, ans=0.125 2023-06-19 05:32:11,539 INFO [train.py:996] (0/4) Epoch 2, batch 17700, loss[loss=0.2866, simple_loss=0.3518, pruned_loss=0.1107, over 21336.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3414, pruned_loss=0.1047, over 4260269.19 frames. ], batch size: 159, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:32:28,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=22.5 2023-06-19 05:32:35,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=289170.0, ans=0.0 2023-06-19 05:33:03,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=289230.0, ans=0.04949747468305833 2023-06-19 05:33:25,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=289290.0, ans=0.125 2023-06-19 05:33:46,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=289350.0, ans=0.125 2023-06-19 05:33:48,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=289350.0, ans=0.125 2023-06-19 05:33:57,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-19 05:34:45,065 INFO [train.py:996] (0/4) Epoch 2, batch 17750, loss[loss=0.3662, simple_loss=0.4145, pruned_loss=0.1589, over 21561.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3506, pruned_loss=0.1095, over 4266874.48 frames. ], batch size: 414, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:34:50,542 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-19 05:35:07,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=289470.0, ans=0.125 2023-06-19 05:35:15,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.691e+02 3.269e+02 3.962e+02 7.961e+02, threshold=6.538e+02, percent-clipped=2.0 2023-06-19 05:35:21,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.30 vs. limit=22.5 2023-06-19 05:35:34,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=289590.0, ans=0.125 2023-06-19 05:37:04,588 INFO [train.py:996] (0/4) Epoch 2, batch 17800, loss[loss=0.2044, simple_loss=0.2538, pruned_loss=0.07755, over 16940.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3483, pruned_loss=0.1078, over 4257806.23 frames. ], batch size: 60, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:37:54,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-19 05:39:34,455 INFO [train.py:996] (0/4) Epoch 2, batch 17850, loss[loss=0.2973, simple_loss=0.3574, pruned_loss=0.1186, over 21979.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3472, pruned_loss=0.108, over 4261315.40 frames. ], batch size: 317, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:39:35,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-19 05:39:57,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=290070.0, ans=0.0 2023-06-19 05:40:01,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.782e+02 3.357e+02 4.257e+02 9.205e+02, threshold=6.714e+02, percent-clipped=6.0 2023-06-19 05:40:03,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=290130.0, ans=0.125 2023-06-19 05:40:59,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=290250.0, ans=0.125 2023-06-19 05:41:00,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=290250.0, ans=0.125 2023-06-19 05:41:21,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=290310.0, ans=0.0 2023-06-19 05:41:41,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=290310.0, ans=0.0 2023-06-19 05:41:50,091 INFO [train.py:996] (0/4) Epoch 2, batch 17900, loss[loss=0.3322, simple_loss=0.4042, pruned_loss=0.1301, over 21615.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.353, pruned_loss=0.1111, over 4268831.88 frames. ], batch size: 389, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:42:06,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=22.5 2023-06-19 05:42:13,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=290430.0, ans=0.125 2023-06-19 05:42:34,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=290430.0, ans=0.125 2023-06-19 05:42:52,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=290490.0, ans=0.2 2023-06-19 05:43:09,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.28 vs. limit=22.5 2023-06-19 05:43:42,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=290610.0, ans=0.125 2023-06-19 05:43:59,600 INFO [train.py:996] (0/4) Epoch 2, batch 17950, loss[loss=0.2232, simple_loss=0.3034, pruned_loss=0.07145, over 21421.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3524, pruned_loss=0.1067, over 4259572.29 frames. ], batch size: 211, lr: 1.61e-02, grad_scale: 32.0 2023-06-19 05:44:34,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.675e+02 3.159e+02 3.778e+02 7.100e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-19 05:44:41,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=290730.0, ans=0.125 2023-06-19 05:44:49,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=290730.0, ans=0.125 2023-06-19 05:45:43,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=290910.0, ans=0.2 2023-06-19 05:46:07,257 INFO [train.py:996] (0/4) Epoch 2, batch 18000, loss[loss=0.2418, simple_loss=0.2972, pruned_loss=0.09322, over 21592.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3455, pruned_loss=0.1051, over 4260726.87 frames. ], batch size: 263, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:46:07,259 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 05:47:00,465 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.0561, 5.1687, 2.4181, 4.5997], device='cuda:0') 2023-06-19 05:47:04,547 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.5081, 3.9526, 3.8950, 3.5075], device='cuda:0') 2023-06-19 05:47:07,985 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2814, simple_loss=0.3799, pruned_loss=0.0915, over 1796401.00 frames. 2023-06-19 05:47:07,987 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 05:47:11,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=290970.0, ans=0.0 2023-06-19 05:47:35,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=291030.0, ans=0.125 2023-06-19 05:48:16,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=291150.0, ans=0.125 2023-06-19 05:48:17,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-19 05:48:50,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=291210.0, ans=0.95 2023-06-19 05:48:52,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-19 05:48:54,871 INFO [train.py:996] (0/4) Epoch 2, batch 18050, loss[loss=0.2404, simple_loss=0.3086, pruned_loss=0.08609, over 21225.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.34, pruned_loss=0.1047, over 4262890.74 frames. ], batch size: 176, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:48:55,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=291270.0, ans=0.125 2023-06-19 05:49:19,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=291270.0, ans=0.125 2023-06-19 05:49:26,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.691e+02 3.245e+02 3.925e+02 6.947e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-19 05:49:29,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=22.5 2023-06-19 05:50:26,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=291450.0, ans=0.125 2023-06-19 05:51:13,353 INFO [train.py:996] (0/4) Epoch 2, batch 18100, loss[loss=0.2706, simple_loss=0.3292, pruned_loss=0.106, over 21709.00 frames. ], tot_loss[loss=0.28, simple_loss=0.345, pruned_loss=0.1075, over 4263684.95 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:51:15,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=291570.0, ans=0.2 2023-06-19 05:51:18,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=291570.0, ans=0.0 2023-06-19 05:51:58,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=291630.0, ans=0.2 2023-06-19 05:52:09,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=291690.0, ans=0.04949747468305833 2023-06-19 05:52:53,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=291810.0, ans=0.2 2023-06-19 05:53:14,868 INFO [train.py:996] (0/4) Epoch 2, batch 18150, loss[loss=0.267, simple_loss=0.3306, pruned_loss=0.1017, over 21818.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3447, pruned_loss=0.107, over 4259574.36 frames. ], batch size: 317, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:53:40,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.848e+02 3.405e+02 4.028e+02 7.103e+02, threshold=6.809e+02, percent-clipped=3.0 2023-06-19 05:54:18,260 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:55:01,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 05:55:05,667 INFO [train.py:996] (0/4) Epoch 2, batch 18200, loss[loss=0.2442, simple_loss=0.2963, pruned_loss=0.09605, over 21569.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3401, pruned_loss=0.1074, over 4261314.18 frames. ], batch size: 263, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:55:35,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=292230.0, ans=0.07 2023-06-19 05:55:51,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=292290.0, ans=0.125 2023-06-19 05:57:11,337 INFO [train.py:996] (0/4) Epoch 2, batch 18250, loss[loss=0.2699, simple_loss=0.3287, pruned_loss=0.1056, over 21926.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3303, pruned_loss=0.1028, over 4263727.90 frames. ], batch size: 333, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:57:13,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=292470.0, ans=0.125 2023-06-19 05:57:29,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=292530.0, ans=0.0 2023-06-19 05:57:30,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.559e+02 2.946e+02 3.738e+02 5.936e+02, threshold=5.891e+02, percent-clipped=0.0 2023-06-19 05:58:02,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=292590.0, ans=0.0 2023-06-19 05:59:09,776 INFO [train.py:996] (0/4) Epoch 2, batch 18300, loss[loss=0.3735, simple_loss=0.4456, pruned_loss=0.1506, over 21539.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3297, pruned_loss=0.1032, over 4267749.40 frames. ], batch size: 471, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 05:59:51,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=292830.0, ans=0.125 2023-06-19 06:00:57,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=293010.0, ans=0.0 2023-06-19 06:01:22,874 INFO [train.py:996] (0/4) Epoch 2, batch 18350, loss[loss=0.2621, simple_loss=0.3186, pruned_loss=0.1028, over 21744.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3335, pruned_loss=0.1029, over 4278916.38 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:01:27,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=293070.0, ans=0.125 2023-06-19 06:01:43,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.069e+02 3.880e+02 5.006e+02 9.959e+02, threshold=7.760e+02, percent-clipped=14.0 2023-06-19 06:02:20,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=293190.0, ans=0.125 2023-06-19 06:03:19,017 INFO [train.py:996] (0/4) Epoch 2, batch 18400, loss[loss=0.2255, simple_loss=0.3022, pruned_loss=0.07438, over 21716.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.33, pruned_loss=0.1014, over 4274101.73 frames. ], batch size: 247, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:03:25,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=293370.0, ans=0.2 2023-06-19 06:03:37,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-19 06:04:42,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=293490.0, ans=0.125 2023-06-19 06:05:02,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=293550.0, ans=0.0 2023-06-19 06:05:10,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=293550.0, ans=0.125 2023-06-19 06:05:28,485 INFO [train.py:996] (0/4) Epoch 2, batch 18450, loss[loss=0.2064, simple_loss=0.2746, pruned_loss=0.06913, over 21234.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3265, pruned_loss=0.09673, over 4278070.83 frames. ], batch size: 159, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:05:48,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.533e+02 3.129e+02 3.878e+02 6.267e+02, threshold=6.259e+02, percent-clipped=0.0 2023-06-19 06:05:51,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-19 06:06:20,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=293790.0, ans=0.0 2023-06-19 06:07:14,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=293850.0, ans=0.0 2023-06-19 06:07:14,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=293850.0, ans=0.0 2023-06-19 06:07:34,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293910.0, ans=0.125 2023-06-19 06:07:36,752 INFO [train.py:996] (0/4) Epoch 2, batch 18500, loss[loss=0.2639, simple_loss=0.3315, pruned_loss=0.09815, over 21488.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.323, pruned_loss=0.09565, over 4275619.17 frames. ], batch size: 389, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:07:48,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=293970.0, ans=0.0 2023-06-19 06:07:58,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293970.0, ans=0.125 2023-06-19 06:09:02,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=294150.0, ans=0.0 2023-06-19 06:09:20,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-19 06:09:36,611 INFO [train.py:996] (0/4) Epoch 2, batch 18550, loss[loss=0.2629, simple_loss=0.3112, pruned_loss=0.1073, over 21773.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.32, pruned_loss=0.09492, over 4263336.45 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:09:49,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=294270.0, ans=0.125 2023-06-19 06:10:04,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.529e+02 3.013e+02 3.541e+02 7.378e+02, threshold=6.027e+02, percent-clipped=2.0 2023-06-19 06:10:53,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=294390.0, ans=0.125 2023-06-19 06:11:06,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=294450.0, ans=0.0 2023-06-19 06:11:06,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294450.0, ans=0.1 2023-06-19 06:11:24,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=294450.0, ans=0.125 2023-06-19 06:11:43,214 INFO [train.py:996] (0/4) Epoch 2, batch 18600, loss[loss=0.2516, simple_loss=0.3226, pruned_loss=0.0903, over 21688.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3185, pruned_loss=0.09577, over 4256048.48 frames. ], batch size: 298, lr: 1.60e-02, grad_scale: 32.0 2023-06-19 06:11:52,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=294570.0, ans=0.125 2023-06-19 06:12:30,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=294630.0, ans=0.0 2023-06-19 06:12:58,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=294690.0, ans=0.5 2023-06-19 06:13:49,184 INFO [train.py:996] (0/4) Epoch 2, batch 18650, loss[loss=0.3164, simple_loss=0.3427, pruned_loss=0.145, over 21365.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3187, pruned_loss=0.09681, over 4264798.11 frames. ], batch size: 473, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:13:51,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-19 06:14:00,870 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:14:14,804 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:14:20,858 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.816e+02 3.268e+02 3.940e+02 5.301e+02, threshold=6.535e+02, percent-clipped=0.0 2023-06-19 06:14:21,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-19 06:14:26,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=294930.0, ans=0.125 2023-06-19 06:15:02,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=294990.0, ans=0.2 2023-06-19 06:15:21,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-19 06:15:29,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=295050.0, ans=10.0 2023-06-19 06:15:30,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=295050.0, ans=0.125 2023-06-19 06:15:54,973 INFO [train.py:996] (0/4) Epoch 2, batch 18700, loss[loss=0.2562, simple_loss=0.311, pruned_loss=0.1007, over 21699.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3172, pruned_loss=0.09843, over 4265358.56 frames. ], batch size: 231, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:16:51,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=295290.0, ans=0.125 2023-06-19 06:17:28,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=295350.0, ans=0.025 2023-06-19 06:18:10,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295410.0, ans=0.0 2023-06-19 06:18:15,070 INFO [train.py:996] (0/4) Epoch 2, batch 18750, loss[loss=0.3537, simple_loss=0.4092, pruned_loss=0.1491, over 21607.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3213, pruned_loss=0.102, over 4265101.44 frames. ], batch size: 414, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:18:19,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=295470.0, ans=0.0 2023-06-19 06:18:34,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.770e+02 3.195e+02 3.990e+02 6.392e+02, threshold=6.389e+02, percent-clipped=0.0 2023-06-19 06:20:04,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=295710.0, ans=0.0 2023-06-19 06:20:08,758 INFO [train.py:996] (0/4) Epoch 2, batch 18800, loss[loss=0.3041, simple_loss=0.3755, pruned_loss=0.1163, over 21577.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3275, pruned_loss=0.1033, over 4260289.48 frames. ], batch size: 508, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:20:59,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-19 06:21:40,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-19 06:22:26,189 INFO [train.py:996] (0/4) Epoch 2, batch 18850, loss[loss=0.2488, simple_loss=0.3154, pruned_loss=0.09115, over 21538.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3217, pruned_loss=0.09675, over 4257625.66 frames. ], batch size: 441, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:22:52,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 2.651e+02 3.232e+02 4.439e+02 7.009e+02, threshold=6.464e+02, percent-clipped=3.0 2023-06-19 06:24:22,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296310.0, ans=0.1 2023-06-19 06:24:28,384 INFO [train.py:996] (0/4) Epoch 2, batch 18900, loss[loss=0.2918, simple_loss=0.3425, pruned_loss=0.1206, over 15336.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3193, pruned_loss=0.0979, over 4239136.38 frames. ], batch size: 63, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:24:35,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=296370.0, ans=0.2 2023-06-19 06:24:40,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=296370.0, ans=0.125 2023-06-19 06:24:45,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=296430.0, ans=0.2 2023-06-19 06:25:37,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=296550.0, ans=0.125 2023-06-19 06:25:37,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=296550.0, ans=0.0 2023-06-19 06:26:37,116 INFO [train.py:996] (0/4) Epoch 2, batch 18950, loss[loss=0.2488, simple_loss=0.2984, pruned_loss=0.09963, over 21165.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3199, pruned_loss=0.1005, over 4235732.10 frames. ], batch size: 608, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:26:44,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=296670.0, ans=0.07 2023-06-19 06:27:06,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=296730.0, ans=0.125 2023-06-19 06:27:13,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.916e+02 3.423e+02 4.145e+02 6.065e+02, threshold=6.846e+02, percent-clipped=0.0 2023-06-19 06:28:29,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-19 06:28:57,538 INFO [train.py:996] (0/4) Epoch 2, batch 19000, loss[loss=0.313, simple_loss=0.371, pruned_loss=0.1275, over 21427.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3307, pruned_loss=0.1031, over 4232052.91 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:29:47,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-19 06:29:53,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297090.0, ans=0.125 2023-06-19 06:29:55,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=297090.0, ans=0.125 2023-06-19 06:29:59,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=297090.0, ans=0.125 2023-06-19 06:31:16,421 INFO [train.py:996] (0/4) Epoch 2, batch 19050, loss[loss=0.2614, simple_loss=0.3202, pruned_loss=0.1013, over 21438.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3355, pruned_loss=0.1072, over 4230070.10 frames. ], batch size: 194, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:31:40,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=297270.0, ans=0.0 2023-06-19 06:31:58,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 3.316e+02 4.197e+02 5.124e+02 7.398e+02, threshold=8.394e+02, percent-clipped=4.0 2023-06-19 06:32:53,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=297450.0, ans=0.0 2023-06-19 06:32:58,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=297450.0, ans=0.0 2023-06-19 06:33:00,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=297510.0, ans=0.0 2023-06-19 06:33:40,894 INFO [train.py:996] (0/4) Epoch 2, batch 19100, loss[loss=0.2749, simple_loss=0.3198, pruned_loss=0.115, over 21593.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3327, pruned_loss=0.1079, over 4243583.54 frames. ], batch size: 414, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:33:47,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=297570.0, ans=0.125 2023-06-19 06:33:56,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-19 06:34:21,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=297690.0, ans=10.0 2023-06-19 06:34:56,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297690.0, ans=0.1 2023-06-19 06:35:48,551 INFO [train.py:996] (0/4) Epoch 2, batch 19150, loss[loss=0.3025, simple_loss=0.3488, pruned_loss=0.1281, over 20093.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3357, pruned_loss=0.1087, over 4254373.55 frames. ], batch size: 707, lr: 1.59e-02, grad_scale: 16.0 2023-06-19 06:36:00,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297870.0, ans=0.1 2023-06-19 06:36:01,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=297870.0, ans=0.0 2023-06-19 06:36:18,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-19 06:36:21,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297930.0, ans=0.1 2023-06-19 06:36:23,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 3.259e+02 3.778e+02 5.445e+02 1.039e+03, threshold=7.556e+02, percent-clipped=5.0 2023-06-19 06:36:26,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2023-06-19 06:37:25,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=298050.0, ans=0.125 2023-06-19 06:37:44,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.02 vs. limit=12.0 2023-06-19 06:38:21,146 INFO [train.py:996] (0/4) Epoch 2, batch 19200, loss[loss=0.2662, simple_loss=0.3542, pruned_loss=0.08908, over 21436.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3478, pruned_loss=0.1103, over 4251682.71 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 32.0 2023-06-19 06:38:31,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-19 06:38:32,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=298170.0, ans=0.125 2023-06-19 06:39:42,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=298350.0, ans=0.125 2023-06-19 06:40:23,958 INFO [train.py:996] (0/4) Epoch 2, batch 19250, loss[loss=0.2329, simple_loss=0.3109, pruned_loss=0.07748, over 21785.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3458, pruned_loss=0.1036, over 4244061.35 frames. ], batch size: 247, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:40:47,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 2.384e+02 2.937e+02 3.389e+02 6.470e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-19 06:41:45,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-19 06:42:30,660 INFO [train.py:996] (0/4) Epoch 2, batch 19300, loss[loss=0.2333, simple_loss=0.2982, pruned_loss=0.08425, over 21738.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3428, pruned_loss=0.1038, over 4256191.35 frames. ], batch size: 247, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:42:34,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=298770.0, ans=0.0 2023-06-19 06:43:34,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=298890.0, ans=0.07 2023-06-19 06:43:36,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-19 06:43:41,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.51 vs. limit=22.5 2023-06-19 06:44:14,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=298950.0, ans=0.0 2023-06-19 06:44:44,506 INFO [train.py:996] (0/4) Epoch 2, batch 19350, loss[loss=0.3073, simple_loss=0.3728, pruned_loss=0.1209, over 21612.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3353, pruned_loss=0.09791, over 4256914.00 frames. ], batch size: 473, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:45:28,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.566e+02 3.106e+02 3.775e+02 8.572e+02, threshold=6.211e+02, percent-clipped=2.0 2023-06-19 06:47:03,754 INFO [train.py:996] (0/4) Epoch 2, batch 19400, loss[loss=0.2074, simple_loss=0.2888, pruned_loss=0.06295, over 21569.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.333, pruned_loss=0.09637, over 4262314.69 frames. ], batch size: 230, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:47:44,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=299430.0, ans=0.2 2023-06-19 06:49:09,822 INFO [train.py:996] (0/4) Epoch 2, batch 19450, loss[loss=0.2481, simple_loss=0.3082, pruned_loss=0.09406, over 21613.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3312, pruned_loss=0.0994, over 4268802.25 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:49:17,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=299670.0, ans=0.2 2023-06-19 06:49:30,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=299730.0, ans=0.125 2023-06-19 06:49:34,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.192e+02 3.820e+02 4.643e+02 7.190e+02, threshold=7.640e+02, percent-clipped=4.0 2023-06-19 06:50:01,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=299790.0, ans=0.125 2023-06-19 06:50:31,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-19 06:50:50,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=299850.0, ans=0.125 2023-06-19 06:51:12,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=299910.0, ans=0.0 2023-06-19 06:51:16,585 INFO [train.py:996] (0/4) Epoch 2, batch 19500, loss[loss=0.24, simple_loss=0.3053, pruned_loss=0.08736, over 21672.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3267, pruned_loss=0.1008, over 4264978.77 frames. ], batch size: 247, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:53:25,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-19 06:53:29,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300210.0, ans=0.1 2023-06-19 06:53:36,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=300210.0, ans=0.125 2023-06-19 06:53:38,793 INFO [train.py:996] (0/4) Epoch 2, batch 19550, loss[loss=0.2829, simple_loss=0.3582, pruned_loss=0.1038, over 21739.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.322, pruned_loss=0.09837, over 4262718.70 frames. ], batch size: 414, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:54:26,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=300330.0, ans=0.2 2023-06-19 06:54:27,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.773e+02 3.270e+02 3.884e+02 6.073e+02, threshold=6.540e+02, percent-clipped=0.0 2023-06-19 06:55:17,562 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:55:43,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=300510.0, ans=0.125 2023-06-19 06:55:56,407 INFO [train.py:996] (0/4) Epoch 2, batch 19600, loss[loss=0.293, simple_loss=0.3398, pruned_loss=0.1231, over 21434.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3248, pruned_loss=0.09986, over 4271747.82 frames. ], batch size: 211, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:56:28,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300630.0, ans=0.1 2023-06-19 06:56:59,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-19 06:58:20,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-19 06:58:20,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-19 06:58:20,666 INFO [train.py:996] (0/4) Epoch 2, batch 19650, loss[loss=0.2831, simple_loss=0.3399, pruned_loss=0.1132, over 21827.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3321, pruned_loss=0.1053, over 4274736.90 frames. ], batch size: 298, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 06:58:54,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.810e+02 3.179e+02 3.741e+02 7.713e+02, threshold=6.358e+02, percent-clipped=2.0 2023-06-19 06:59:17,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-19 06:59:32,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=301050.0, ans=0.125 2023-06-19 06:59:56,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=301050.0, ans=0.0 2023-06-19 07:00:52,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=301110.0, ans=0.125 2023-06-19 07:00:55,588 INFO [train.py:996] (0/4) Epoch 2, batch 19700, loss[loss=0.2076, simple_loss=0.2713, pruned_loss=0.07193, over 21387.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3355, pruned_loss=0.1054, over 4276708.70 frames. ], batch size: 131, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:00:57,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=301170.0, ans=0.125 2023-06-19 07:01:03,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=22.5 2023-06-19 07:01:13,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=301230.0, ans=0.125 2023-06-19 07:02:59,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-19 07:03:08,468 INFO [train.py:996] (0/4) Epoch 2, batch 19750, loss[loss=0.2875, simple_loss=0.3557, pruned_loss=0.1096, over 21311.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3464, pruned_loss=0.1077, over 4274191.43 frames. ], batch size: 159, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:03:44,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.569e+02 3.192e+02 3.926e+02 7.719e+02, threshold=6.384e+02, percent-clipped=3.0 2023-06-19 07:04:19,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=301590.0, ans=0.0 2023-06-19 07:05:13,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=301710.0, ans=0.125 2023-06-19 07:05:27,407 INFO [train.py:996] (0/4) Epoch 2, batch 19800, loss[loss=0.288, simple_loss=0.3466, pruned_loss=0.1147, over 21857.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3452, pruned_loss=0.1087, over 4280536.40 frames. ], batch size: 414, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:05:57,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=301770.0, ans=0.125 2023-06-19 07:06:03,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-06-19 07:06:18,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=301830.0, ans=0.125 2023-06-19 07:06:43,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=301890.0, ans=0.125 2023-06-19 07:06:50,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=301890.0, ans=0.125 2023-06-19 07:06:57,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=301950.0, ans=0.0 2023-06-19 07:07:42,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=15.0 2023-06-19 07:07:55,601 INFO [train.py:996] (0/4) Epoch 2, batch 19850, loss[loss=0.2446, simple_loss=0.3311, pruned_loss=0.07905, over 21242.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3375, pruned_loss=0.1025, over 4275588.03 frames. ], batch size: 548, lr: 1.58e-02, grad_scale: 32.0 2023-06-19 07:08:23,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=302070.0, ans=0.2 2023-06-19 07:08:29,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.580e+02 3.192e+02 4.086e+02 8.227e+02, threshold=6.384e+02, percent-clipped=3.0 2023-06-19 07:08:44,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-19 07:09:26,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=302250.0, ans=0.025 2023-06-19 07:09:40,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=302250.0, ans=0.0 2023-06-19 07:09:47,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=302310.0, ans=0.05 2023-06-19 07:10:15,736 INFO [train.py:996] (0/4) Epoch 2, batch 19900, loss[loss=0.2545, simple_loss=0.3351, pruned_loss=0.08695, over 21610.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3382, pruned_loss=0.1001, over 4268936.58 frames. ], batch size: 263, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:10:37,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=302430.0, ans=0.5 2023-06-19 07:10:54,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=302430.0, ans=0.125 2023-06-19 07:12:13,341 INFO [train.py:996] (0/4) Epoch 2, batch 19950, loss[loss=0.2365, simple_loss=0.2972, pruned_loss=0.08789, over 21392.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3307, pruned_loss=0.09984, over 4269638.66 frames. ], batch size: 144, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:12:23,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=302670.0, ans=0.125 2023-06-19 07:12:32,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-19 07:12:46,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.592e+02 3.325e+02 4.122e+02 6.437e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-19 07:12:47,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.54 vs. limit=15.0 2023-06-19 07:13:20,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=302790.0, ans=0.125 2023-06-19 07:13:27,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-19 07:13:39,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=302850.0, ans=0.2 2023-06-19 07:14:23,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=302910.0, ans=0.0 2023-06-19 07:14:28,314 INFO [train.py:996] (0/4) Epoch 2, batch 20000, loss[loss=0.2709, simple_loss=0.3308, pruned_loss=0.1055, over 21849.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3308, pruned_loss=0.09963, over 4274689.32 frames. ], batch size: 124, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:15:18,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=303030.0, ans=0.035 2023-06-19 07:15:26,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=303030.0, ans=0.125 2023-06-19 07:15:51,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=303090.0, ans=0.125 2023-06-19 07:16:09,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=303150.0, ans=0.125 2023-06-19 07:16:36,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=22.5 2023-06-19 07:16:47,202 INFO [train.py:996] (0/4) Epoch 2, batch 20050, loss[loss=0.2873, simple_loss=0.3432, pruned_loss=0.1157, over 21925.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3334, pruned_loss=0.1023, over 4285952.91 frames. ], batch size: 333, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:17:17,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 3.033e+02 3.423e+02 3.985e+02 8.117e+02, threshold=6.846e+02, percent-clipped=3.0 2023-06-19 07:18:24,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.67 vs. limit=22.5 2023-06-19 07:18:35,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-19 07:18:36,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303510.0, ans=0.125 2023-06-19 07:19:11,620 INFO [train.py:996] (0/4) Epoch 2, batch 20100, loss[loss=0.2885, simple_loss=0.3506, pruned_loss=0.1132, over 19779.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3362, pruned_loss=0.1057, over 4292426.52 frames. ], batch size: 704, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:19:21,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-19 07:19:59,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=303630.0, ans=0.0 2023-06-19 07:20:16,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=303690.0, ans=0.0 2023-06-19 07:20:19,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=303690.0, ans=0.035 2023-06-19 07:20:29,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-19 07:21:08,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=303810.0, ans=0.125 2023-06-19 07:21:56,490 INFO [train.py:996] (0/4) Epoch 2, batch 20150, loss[loss=0.2961, simple_loss=0.3604, pruned_loss=0.1159, over 21754.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3465, pruned_loss=0.1103, over 4291033.04 frames. ], batch size: 298, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:22:26,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.980e+02 4.036e+02 4.841e+02 1.073e+03, threshold=8.072e+02, percent-clipped=4.0 2023-06-19 07:22:38,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=303990.0, ans=0.07 2023-06-19 07:23:21,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=304050.0, ans=0.0 2023-06-19 07:23:57,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=304110.0, ans=0.125 2023-06-19 07:24:04,561 INFO [train.py:996] (0/4) Epoch 2, batch 20200, loss[loss=0.272, simple_loss=0.339, pruned_loss=0.1025, over 21628.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3535, pruned_loss=0.1137, over 4280867.85 frames. ], batch size: 263, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:24:10,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=304170.0, ans=0.125 2023-06-19 07:24:45,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=304230.0, ans=0.0 2023-06-19 07:24:46,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.72 vs. limit=22.5 2023-06-19 07:24:51,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=304230.0, ans=0.125 2023-06-19 07:25:01,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-19 07:25:15,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=304290.0, ans=0.0 2023-06-19 07:25:46,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=304350.0, ans=0.125 2023-06-19 07:26:30,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-19 07:26:31,276 INFO [train.py:996] (0/4) Epoch 2, batch 20250, loss[loss=0.3239, simple_loss=0.3683, pruned_loss=0.1398, over 21623.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3524, pruned_loss=0.1114, over 4279618.78 frames. ], batch size: 507, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:26:55,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.816e+02 3.333e+02 3.978e+02 6.194e+02, threshold=6.665e+02, percent-clipped=0.0 2023-06-19 07:27:01,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=304530.0, ans=0.1 2023-06-19 07:27:11,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-19 07:28:02,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=15.0 2023-06-19 07:28:36,320 INFO [train.py:996] (0/4) Epoch 2, batch 20300, loss[loss=0.2835, simple_loss=0.3659, pruned_loss=0.1005, over 21183.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3515, pruned_loss=0.1089, over 4274013.19 frames. ], batch size: 548, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:28:44,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=304770.0, ans=0.1 2023-06-19 07:29:00,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=304830.0, ans=0.125 2023-06-19 07:29:22,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=304890.0, ans=0.0 2023-06-19 07:29:37,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=304890.0, ans=0.125 2023-06-19 07:30:17,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=305010.0, ans=0.0 2023-06-19 07:30:34,158 INFO [train.py:996] (0/4) Epoch 2, batch 20350, loss[loss=0.3408, simple_loss=0.3838, pruned_loss=0.149, over 21800.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3505, pruned_loss=0.1089, over 4275627.71 frames. ], batch size: 441, lr: 1.57e-02, grad_scale: 16.0 2023-06-19 07:30:39,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=305070.0, ans=0.125 2023-06-19 07:30:59,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=305130.0, ans=0.125 2023-06-19 07:31:00,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.745e+02 3.216e+02 3.932e+02 7.808e+02, threshold=6.432e+02, percent-clipped=2.0 2023-06-19 07:32:01,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-19 07:32:38,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 07:32:39,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=305310.0, ans=0.07 2023-06-19 07:32:48,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=305310.0, ans=0.0 2023-06-19 07:32:55,100 INFO [train.py:996] (0/4) Epoch 2, batch 20400, loss[loss=0.3106, simple_loss=0.3691, pruned_loss=0.126, over 21946.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3535, pruned_loss=0.1118, over 4266641.31 frames. ], batch size: 316, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:33:03,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-19 07:33:05,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=305370.0, ans=0.125 2023-06-19 07:33:34,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=305430.0, ans=0.125 2023-06-19 07:33:45,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=305490.0, ans=0.0 2023-06-19 07:34:12,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=305490.0, ans=0.2 2023-06-19 07:34:19,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=305550.0, ans=0.125 2023-06-19 07:34:22,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=305550.0, ans=0.125 2023-06-19 07:34:57,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=305610.0, ans=0.125 2023-06-19 07:35:00,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-19 07:35:00,799 INFO [train.py:996] (0/4) Epoch 2, batch 20450, loss[loss=0.2921, simple_loss=0.3465, pruned_loss=0.1189, over 21791.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3543, pruned_loss=0.1143, over 4267690.99 frames. ], batch size: 247, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:35:34,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.841e+02 3.405e+02 4.322e+02 6.691e+02, threshold=6.810e+02, percent-clipped=2.0 2023-06-19 07:36:10,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305850.0, ans=0.1 2023-06-19 07:37:12,741 INFO [train.py:996] (0/4) Epoch 2, batch 20500, loss[loss=0.2866, simple_loss=0.3448, pruned_loss=0.1142, over 21402.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3516, pruned_loss=0.1153, over 4259707.01 frames. ], batch size: 131, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:37:15,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=305970.0, ans=0.125 2023-06-19 07:37:31,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=306030.0, ans=0.125 2023-06-19 07:38:50,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-19 07:39:27,699 INFO [train.py:996] (0/4) Epoch 2, batch 20550, loss[loss=0.2449, simple_loss=0.3211, pruned_loss=0.08439, over 21442.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3438, pruned_loss=0.1124, over 4258089.90 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-19 07:39:29,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306270.0, ans=0.1 2023-06-19 07:40:03,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.682e+02 3.144e+02 3.592e+02 6.172e+02, threshold=6.288e+02, percent-clipped=0.0 2023-06-19 07:40:07,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=306330.0, ans=0.0 2023-06-19 07:40:20,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=306330.0, ans=0.05 2023-06-19 07:41:14,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306450.0, ans=0.1 2023-06-19 07:41:39,992 INFO [train.py:996] (0/4) Epoch 2, batch 20600, loss[loss=0.2341, simple_loss=0.3044, pruned_loss=0.08189, over 16475.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3448, pruned_loss=0.1105, over 4256046.55 frames. ], batch size: 61, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:43:25,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=306750.0, ans=22.5 2023-06-19 07:43:37,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=306810.0, ans=0.0 2023-06-19 07:43:48,062 INFO [train.py:996] (0/4) Epoch 2, batch 20650, loss[loss=0.2337, simple_loss=0.291, pruned_loss=0.08823, over 21503.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3401, pruned_loss=0.1106, over 4269185.48 frames. ], batch size: 195, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:43:58,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-19 07:43:59,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=306870.0, ans=0.125 2023-06-19 07:44:18,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.691e+02 3.207e+02 4.278e+02 6.062e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-19 07:44:28,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=306930.0, ans=0.125 2023-06-19 07:45:12,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=307050.0, ans=0.0 2023-06-19 07:45:26,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=307050.0, ans=0.09899494936611666 2023-06-19 07:45:53,885 INFO [train.py:996] (0/4) Epoch 2, batch 20700, loss[loss=0.3399, simple_loss=0.4029, pruned_loss=0.1384, over 21490.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3333, pruned_loss=0.1064, over 4269062.89 frames. ], batch size: 471, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:45:55,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=307170.0, ans=0.2 2023-06-19 07:45:59,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-19 07:46:34,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307230.0, ans=0.1 2023-06-19 07:46:43,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=307230.0, ans=0.2 2023-06-19 07:48:09,369 INFO [train.py:996] (0/4) Epoch 2, batch 20750, loss[loss=0.3254, simple_loss=0.4101, pruned_loss=0.1203, over 21759.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3359, pruned_loss=0.1064, over 4244943.38 frames. ], batch size: 332, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:48:19,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=307470.0, ans=0.125 2023-06-19 07:48:27,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=307470.0, ans=0.0 2023-06-19 07:48:43,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-19 07:48:44,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.969e+02 3.611e+02 4.710e+02 7.755e+02, threshold=7.221e+02, percent-clipped=2.0 2023-06-19 07:49:00,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=307530.0, ans=0.125 2023-06-19 07:49:53,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-19 07:50:11,258 INFO [train.py:996] (0/4) Epoch 2, batch 20800, loss[loss=0.3302, simple_loss=0.3533, pruned_loss=0.1536, over 21323.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3376, pruned_loss=0.107, over 4244948.93 frames. ], batch size: 507, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:50:22,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=307770.0, ans=0.125 2023-06-19 07:51:10,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=307890.0, ans=0.2 2023-06-19 07:51:25,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-19 07:52:34,130 INFO [train.py:996] (0/4) Epoch 2, batch 20850, loss[loss=0.2217, simple_loss=0.2809, pruned_loss=0.08128, over 16673.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3303, pruned_loss=0.1043, over 4232489.37 frames. ], batch size: 61, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:53:03,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.789e+02 3.231e+02 4.140e+02 1.099e+03, threshold=6.461e+02, percent-clipped=5.0 2023-06-19 07:53:32,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308190.0, ans=0.125 2023-06-19 07:54:03,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-19 07:54:34,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=308310.0, ans=0.04949747468305833 2023-06-19 07:54:35,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-19 07:54:37,011 INFO [train.py:996] (0/4) Epoch 2, batch 20900, loss[loss=0.254, simple_loss=0.318, pruned_loss=0.09501, over 21165.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3306, pruned_loss=0.1052, over 4246671.97 frames. ], batch size: 159, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:54:39,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-19 07:55:15,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=308430.0, ans=0.125 2023-06-19 07:55:46,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=308490.0, ans=0.125 2023-06-19 07:56:33,637 INFO [train.py:996] (0/4) Epoch 2, batch 20950, loss[loss=0.2141, simple_loss=0.2815, pruned_loss=0.07342, over 21693.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3253, pruned_loss=0.1004, over 4250648.21 frames. ], batch size: 298, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:56:36,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=308670.0, ans=0.0 2023-06-19 07:56:56,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.542e+02 3.174e+02 4.032e+02 6.054e+02, threshold=6.348e+02, percent-clipped=0.0 2023-06-19 07:56:57,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2023-06-19 07:57:31,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308790.0, ans=0.1 2023-06-19 07:57:31,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308790.0, ans=0.1 2023-06-19 07:57:31,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-19 07:58:07,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-19 07:58:34,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=308910.0, ans=0.0 2023-06-19 07:58:38,194 INFO [train.py:996] (0/4) Epoch 2, batch 21000, loss[loss=0.254, simple_loss=0.3113, pruned_loss=0.09832, over 21817.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3241, pruned_loss=0.1008, over 4255211.36 frames. ], batch size: 282, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 07:58:38,195 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 07:59:32,902 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2892, simple_loss=0.3858, pruned_loss=0.09632, over 1796401.00 frames. 2023-06-19 07:59:32,904 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 07:59:44,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=308970.0, ans=0.125 2023-06-19 07:59:51,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=309030.0, ans=0.125 2023-06-19 07:59:56,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=309030.0, ans=0.0 2023-06-19 08:00:20,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=309090.0, ans=0.0 2023-06-19 08:00:33,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=309150.0, ans=0.125 2023-06-19 08:00:35,310 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:01:14,062 INFO [train.py:996] (0/4) Epoch 2, batch 21050, loss[loss=0.2531, simple_loss=0.314, pruned_loss=0.09604, over 21890.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3235, pruned_loss=0.1017, over 4260993.19 frames. ], batch size: 107, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:01:45,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.810e+02 3.247e+02 3.992e+02 5.990e+02, threshold=6.494e+02, percent-clipped=0.0 2023-06-19 08:03:08,458 INFO [train.py:996] (0/4) Epoch 2, batch 21100, loss[loss=0.2399, simple_loss=0.2851, pruned_loss=0.0973, over 21580.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3201, pruned_loss=0.1018, over 4257865.81 frames. ], batch size: 231, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:03:53,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=309630.0, ans=0.125 2023-06-19 08:03:54,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=309630.0, ans=0.2 2023-06-19 08:04:36,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=309690.0, ans=0.2 2023-06-19 08:05:00,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.12 vs. limit=6.0 2023-06-19 08:05:14,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=309810.0, ans=0.125 2023-06-19 08:05:17,515 INFO [train.py:996] (0/4) Epoch 2, batch 21150, loss[loss=0.2755, simple_loss=0.3111, pruned_loss=0.1199, over 21692.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3153, pruned_loss=0.1013, over 4257956.71 frames. ], batch size: 417, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:05:18,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=309870.0, ans=0.04949747468305833 2023-06-19 08:05:39,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=309930.0, ans=0.2 2023-06-19 08:05:40,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.684e+02 3.280e+02 4.252e+02 7.142e+02, threshold=6.560e+02, percent-clipped=1.0 2023-06-19 08:07:12,908 INFO [train.py:996] (0/4) Epoch 2, batch 21200, loss[loss=0.2796, simple_loss=0.3188, pruned_loss=0.1202, over 21423.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3113, pruned_loss=0.1002, over 4246223.81 frames. ], batch size: 508, lr: 1.56e-02, grad_scale: 32.0 2023-06-19 08:07:24,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-19 08:07:24,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-19 08:08:03,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=310290.0, ans=0.2 2023-06-19 08:08:05,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=310290.0, ans=0.125 2023-06-19 08:09:05,316 INFO [train.py:996] (0/4) Epoch 2, batch 21250, loss[loss=0.2364, simple_loss=0.2954, pruned_loss=0.08866, over 21811.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3103, pruned_loss=0.1006, over 4252373.38 frames. ], batch size: 112, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:09:07,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310470.0, ans=0.125 2023-06-19 08:09:25,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-19 08:09:28,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.442e+02 2.877e+02 3.403e+02 5.738e+02, threshold=5.754e+02, percent-clipped=0.0 2023-06-19 08:09:55,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-19 08:10:51,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=310710.0, ans=0.0 2023-06-19 08:10:51,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=310710.0, ans=0.2 2023-06-19 08:11:05,116 INFO [train.py:996] (0/4) Epoch 2, batch 21300, loss[loss=0.2846, simple_loss=0.3546, pruned_loss=0.1072, over 21788.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3188, pruned_loss=0.1036, over 4246363.78 frames. ], batch size: 351, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:11:57,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.78 vs. limit=22.5 2023-06-19 08:11:59,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-19 08:12:23,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-19 08:12:44,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=310950.0, ans=0.125 2023-06-19 08:13:29,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-19 08:13:31,859 INFO [train.py:996] (0/4) Epoch 2, batch 21350, loss[loss=0.2729, simple_loss=0.3531, pruned_loss=0.0963, over 21817.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3242, pruned_loss=0.1048, over 4251674.25 frames. ], batch size: 351, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:13:32,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=311070.0, ans=0.125 2023-06-19 08:14:12,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.747e+02 3.368e+02 4.316e+02 7.083e+02, threshold=6.735e+02, percent-clipped=3.0 2023-06-19 08:15:22,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-19 08:15:32,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-19 08:15:39,917 INFO [train.py:996] (0/4) Epoch 2, batch 21400, loss[loss=0.2925, simple_loss=0.3539, pruned_loss=0.1156, over 21385.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3284, pruned_loss=0.1043, over 4258949.31 frames. ], batch size: 159, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:17:26,785 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:17:28,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311550.0, ans=0.1 2023-06-19 08:17:32,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311610.0, ans=0.1 2023-06-19 08:17:54,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-19 08:18:01,522 INFO [train.py:996] (0/4) Epoch 2, batch 21450, loss[loss=0.2956, simple_loss=0.3519, pruned_loss=0.1196, over 21814.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.333, pruned_loss=0.1071, over 4265891.96 frames. ], batch size: 124, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:18:13,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.11 vs. limit=15.0 2023-06-19 08:18:35,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.772e+02 3.350e+02 3.887e+02 8.399e+02, threshold=6.699e+02, percent-clipped=2.0 2023-06-19 08:19:18,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=311790.0, ans=0.0 2023-06-19 08:19:19,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311790.0, ans=0.125 2023-06-19 08:19:21,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=311790.0, ans=0.0 2023-06-19 08:19:24,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=311850.0, ans=0.0 2023-06-19 08:19:27,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-19 08:19:43,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=311910.0, ans=0.0 2023-06-19 08:20:14,689 INFO [train.py:996] (0/4) Epoch 2, batch 21500, loss[loss=0.2632, simple_loss=0.3096, pruned_loss=0.1084, over 21243.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3313, pruned_loss=0.108, over 4262597.68 frames. ], batch size: 608, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:20:18,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=311970.0, ans=0.125 2023-06-19 08:20:19,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=311970.0, ans=0.125 2023-06-19 08:20:27,683 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-52000.pt 2023-06-19 08:21:27,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=312150.0, ans=0.125 2023-06-19 08:22:17,593 INFO [train.py:996] (0/4) Epoch 2, batch 21550, loss[loss=0.215, simple_loss=0.2829, pruned_loss=0.07356, over 21661.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3223, pruned_loss=0.1037, over 4260361.24 frames. ], batch size: 415, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:22:45,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.454e+02 2.906e+02 3.459e+02 5.516e+02, threshold=5.812e+02, percent-clipped=0.0 2023-06-19 08:23:42,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=312450.0, ans=0.2 2023-06-19 08:23:51,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=312450.0, ans=0.125 2023-06-19 08:24:33,359 INFO [train.py:996] (0/4) Epoch 2, batch 21600, loss[loss=0.2374, simple_loss=0.3194, pruned_loss=0.07773, over 21551.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3167, pruned_loss=0.1012, over 4265544.50 frames. ], batch size: 263, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:25:48,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312750.0, ans=0.1 2023-06-19 08:26:08,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-19 08:26:33,276 INFO [train.py:996] (0/4) Epoch 2, batch 21650, loss[loss=0.2628, simple_loss=0.3723, pruned_loss=0.07667, over 20834.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3208, pruned_loss=0.09875, over 4264975.51 frames. ], batch size: 607, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:26:50,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.980e+02 3.690e+02 4.543e+02 8.571e+02, threshold=7.379e+02, percent-clipped=9.0 2023-06-19 08:27:28,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-19 08:27:39,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=313050.0, ans=0.07 2023-06-19 08:27:43,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=313050.0, ans=0.2 2023-06-19 08:28:19,140 INFO [train.py:996] (0/4) Epoch 2, batch 21700, loss[loss=0.2373, simple_loss=0.2966, pruned_loss=0.08895, over 21392.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.321, pruned_loss=0.09557, over 4269218.71 frames. ], batch size: 211, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:28:58,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=313230.0, ans=0.125 2023-06-19 08:29:00,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=313230.0, ans=0.125 2023-06-19 08:30:16,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-19 08:30:16,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.49 vs. limit=22.5 2023-06-19 08:30:23,471 INFO [train.py:996] (0/4) Epoch 2, batch 21750, loss[loss=0.2538, simple_loss=0.2951, pruned_loss=0.1063, over 21249.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3172, pruned_loss=0.0965, over 4264598.66 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:30:25,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=313470.0, ans=0.125 2023-06-19 08:30:25,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313470.0, ans=0.125 2023-06-19 08:30:47,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.599e+02 3.130e+02 4.451e+02 8.277e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-19 08:31:17,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313590.0, ans=0.1 2023-06-19 08:31:21,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313590.0, ans=0.0 2023-06-19 08:31:34,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=313650.0, ans=0.0 2023-06-19 08:32:34,372 INFO [train.py:996] (0/4) Epoch 2, batch 21800, loss[loss=0.2769, simple_loss=0.3469, pruned_loss=0.1034, over 21629.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3156, pruned_loss=0.09824, over 4273531.29 frames. ], batch size: 391, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:32:46,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313770.0, ans=0.1 2023-06-19 08:34:01,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-19 08:34:21,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=314070.0, ans=0.125 2023-06-19 08:34:22,212 INFO [train.py:996] (0/4) Epoch 2, batch 21850, loss[loss=0.2449, simple_loss=0.3329, pruned_loss=0.07845, over 21378.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3225, pruned_loss=0.0993, over 4270981.07 frames. ], batch size: 211, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:34:35,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314070.0, ans=0.125 2023-06-19 08:34:53,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=314130.0, ans=0.125 2023-06-19 08:34:56,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.670e+02 3.094e+02 3.698e+02 5.413e+02, threshold=6.187e+02, percent-clipped=0.0 2023-06-19 08:34:59,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-19 08:35:45,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=314250.0, ans=0.0 2023-06-19 08:35:47,435 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:36:01,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314310.0, ans=0.1 2023-06-19 08:36:02,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=314310.0, ans=0.0 2023-06-19 08:36:37,152 INFO [train.py:996] (0/4) Epoch 2, batch 21900, loss[loss=0.2528, simple_loss=0.3042, pruned_loss=0.1007, over 21714.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3239, pruned_loss=0.1013, over 4276509.35 frames. ], batch size: 333, lr: 1.55e-02, grad_scale: 32.0 2023-06-19 08:36:48,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314370.0, ans=0.1 2023-06-19 08:37:05,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314430.0, ans=0.125 2023-06-19 08:38:10,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314610.0, ans=0.125 2023-06-19 08:38:11,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314610.0, ans=0.1 2023-06-19 08:38:12,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-19 08:38:33,250 INFO [train.py:996] (0/4) Epoch 2, batch 21950, loss[loss=0.2598, simple_loss=0.2985, pruned_loss=0.1106, over 20993.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3188, pruned_loss=0.09976, over 4276260.43 frames. ], batch size: 607, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:38:35,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=314670.0, ans=0.125 2023-06-19 08:38:37,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=314670.0, ans=0.125 2023-06-19 08:39:07,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.758e+02 3.297e+02 3.878e+02 5.596e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 08:39:54,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=314850.0, ans=0.125 2023-06-19 08:40:16,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=314910.0, ans=0.09899494936611666 2023-06-19 08:40:39,712 INFO [train.py:996] (0/4) Epoch 2, batch 22000, loss[loss=0.2383, simple_loss=0.2999, pruned_loss=0.08835, over 21436.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3113, pruned_loss=0.09553, over 4258395.43 frames. ], batch size: 131, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:41:02,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=315030.0, ans=0.025 2023-06-19 08:41:53,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-19 08:42:08,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=315150.0, ans=0.125 2023-06-19 08:42:13,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=315150.0, ans=0.125 2023-06-19 08:42:51,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=315270.0, ans=0.125 2023-06-19 08:42:51,922 INFO [train.py:996] (0/4) Epoch 2, batch 22050, loss[loss=0.1887, simple_loss=0.2756, pruned_loss=0.05092, over 21660.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3182, pruned_loss=0.09826, over 4263470.43 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:43:21,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.619e+02 3.176e+02 4.335e+02 6.749e+02, threshold=6.352e+02, percent-clipped=1.0 2023-06-19 08:43:27,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315330.0, ans=0.125 2023-06-19 08:44:51,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=315510.0, ans=0.2 2023-06-19 08:44:52,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=315510.0, ans=0.2 2023-06-19 08:45:05,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=315570.0, ans=0.2 2023-06-19 08:45:06,816 INFO [train.py:996] (0/4) Epoch 2, batch 22100, loss[loss=0.2896, simple_loss=0.3505, pruned_loss=0.1144, over 21272.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3304, pruned_loss=0.1047, over 4260808.93 frames. ], batch size: 143, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:45:33,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=315630.0, ans=0.04949747468305833 2023-06-19 08:45:42,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=315690.0, ans=0.0 2023-06-19 08:46:21,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=315750.0, ans=0.09899494936611666 2023-06-19 08:46:23,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=315750.0, ans=0.125 2023-06-19 08:46:43,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=315810.0, ans=0.2 2023-06-19 08:47:01,478 INFO [train.py:996] (0/4) Epoch 2, batch 22150, loss[loss=0.2672, simple_loss=0.3334, pruned_loss=0.1005, over 21448.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3332, pruned_loss=0.1067, over 4269030.24 frames. ], batch size: 211, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:47:23,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.220e+02 3.673e+02 4.139e+02 7.886e+02, threshold=7.346e+02, percent-clipped=1.0 2023-06-19 08:47:35,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315930.0, ans=0.1 2023-06-19 08:48:13,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=315990.0, ans=0.125 2023-06-19 08:49:12,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=316110.0, ans=0.2 2023-06-19 08:49:13,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=316110.0, ans=0.125 2023-06-19 08:49:16,314 INFO [train.py:996] (0/4) Epoch 2, batch 22200, loss[loss=0.2639, simple_loss=0.3274, pruned_loss=0.1002, over 21423.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.334, pruned_loss=0.1078, over 4277850.22 frames. ], batch size: 177, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:49:28,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=316170.0, ans=0.125 2023-06-19 08:49:41,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=316230.0, ans=0.0 2023-06-19 08:50:53,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=316350.0, ans=0.125 2023-06-19 08:51:06,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=316410.0, ans=0.0 2023-06-19 08:51:22,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316410.0, ans=0.1 2023-06-19 08:51:24,968 INFO [train.py:996] (0/4) Epoch 2, batch 22250, loss[loss=0.335, simple_loss=0.3976, pruned_loss=0.1362, over 21885.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3436, pruned_loss=0.1103, over 4271110.82 frames. ], batch size: 124, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:51:58,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.811e+02 3.425e+02 4.060e+02 7.172e+02, threshold=6.851e+02, percent-clipped=0.0 2023-06-19 08:52:44,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=316650.0, ans=0.2 2023-06-19 08:53:14,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=316710.0, ans=0.0 2023-06-19 08:53:19,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-19 08:53:34,327 INFO [train.py:996] (0/4) Epoch 2, batch 22300, loss[loss=0.2973, simple_loss=0.3447, pruned_loss=0.125, over 21813.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3458, pruned_loss=0.1126, over 4279693.51 frames. ], batch size: 441, lr: 1.54e-02, grad_scale: 64.0 2023-06-19 08:53:36,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=316770.0, ans=0.0 2023-06-19 08:54:08,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=316770.0, ans=0.0 2023-06-19 08:55:20,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-19 08:55:36,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=317010.0, ans=0.2 2023-06-19 08:55:46,284 INFO [train.py:996] (0/4) Epoch 2, batch 22350, loss[loss=0.2689, simple_loss=0.3261, pruned_loss=0.1059, over 21859.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3436, pruned_loss=0.1128, over 4282355.22 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:55:58,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=317070.0, ans=0.125 2023-06-19 08:56:05,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-19 08:56:31,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=317130.0, ans=0.0 2023-06-19 08:56:32,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.798e+02 3.468e+02 4.054e+02 6.110e+02, threshold=6.936e+02, percent-clipped=0.0 2023-06-19 08:56:33,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=317130.0, ans=0.125 2023-06-19 08:57:01,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-19 08:57:57,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=317310.0, ans=0.125 2023-06-19 08:58:18,577 INFO [train.py:996] (0/4) Epoch 2, batch 22400, loss[loss=0.2345, simple_loss=0.3019, pruned_loss=0.08358, over 21569.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3387, pruned_loss=0.1077, over 4285733.38 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 08:59:37,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-19 08:59:41,330 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:00:19,684 INFO [train.py:996] (0/4) Epoch 2, batch 22450, loss[loss=0.2219, simple_loss=0.2787, pruned_loss=0.08256, over 21547.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3323, pruned_loss=0.107, over 4276619.42 frames. ], batch size: 231, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:00:20,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-19 09:00:50,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.620e+02 3.213e+02 3.522e+02 5.684e+02, threshold=6.426e+02, percent-clipped=0.0 2023-06-19 09:01:11,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=317730.0, ans=0.0 2023-06-19 09:01:59,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-19 09:02:24,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=317910.0, ans=0.04949747468305833 2023-06-19 09:02:29,737 INFO [train.py:996] (0/4) Epoch 2, batch 22500, loss[loss=0.2747, simple_loss=0.3377, pruned_loss=0.1058, over 21243.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3261, pruned_loss=0.1057, over 4272253.77 frames. ], batch size: 176, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:02:48,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=317970.0, ans=0.125 2023-06-19 09:02:50,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=317970.0, ans=0.125 2023-06-19 09:03:50,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=318090.0, ans=0.125 2023-06-19 09:03:51,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-19 09:04:36,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=318210.0, ans=0.2 2023-06-19 09:04:47,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=318210.0, ans=0.0 2023-06-19 09:04:50,918 INFO [train.py:996] (0/4) Epoch 2, batch 22550, loss[loss=0.3564, simple_loss=0.4078, pruned_loss=0.1525, over 21543.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3299, pruned_loss=0.1063, over 4271926.45 frames. ], batch size: 471, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:05:26,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.743e+02 3.376e+02 4.420e+02 1.013e+03, threshold=6.752e+02, percent-clipped=6.0 2023-06-19 09:07:02,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=318510.0, ans=0.0 2023-06-19 09:07:02,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-19 09:07:08,861 INFO [train.py:996] (0/4) Epoch 2, batch 22600, loss[loss=0.2097, simple_loss=0.2619, pruned_loss=0.07876, over 21172.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3326, pruned_loss=0.1069, over 4274590.98 frames. ], batch size: 143, lr: 1.54e-02, grad_scale: 32.0 2023-06-19 09:07:40,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=318570.0, ans=0.95 2023-06-19 09:09:05,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=318810.0, ans=0.0 2023-06-19 09:09:22,036 INFO [train.py:996] (0/4) Epoch 2, batch 22650, loss[loss=0.221, simple_loss=0.2791, pruned_loss=0.08142, over 21425.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3295, pruned_loss=0.1053, over 4270815.68 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:09:57,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.722e+02 3.113e+02 3.879e+02 6.383e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-19 09:10:33,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-19 09:11:02,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=319110.0, ans=0.0 2023-06-19 09:11:09,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=319110.0, ans=0.125 2023-06-19 09:11:29,474 INFO [train.py:996] (0/4) Epoch 2, batch 22700, loss[loss=0.2407, simple_loss=0.2921, pruned_loss=0.09458, over 21557.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.324, pruned_loss=0.1047, over 4265940.03 frames. ], batch size: 247, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:11:46,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=319170.0, ans=0.125 2023-06-19 09:12:29,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=319290.0, ans=0.125 2023-06-19 09:13:36,628 INFO [train.py:996] (0/4) Epoch 2, batch 22750, loss[loss=0.3206, simple_loss=0.3727, pruned_loss=0.1342, over 21141.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3244, pruned_loss=0.1064, over 4266358.32 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:13:40,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.11 vs. limit=10.0 2023-06-19 09:13:52,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=319470.0, ans=0.0 2023-06-19 09:14:05,222 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:14:26,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.878e+02 3.333e+02 3.990e+02 8.279e+02, threshold=6.666e+02, percent-clipped=1.0 2023-06-19 09:14:55,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=319650.0, ans=0.05 2023-06-19 09:15:24,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=319650.0, ans=0.125 2023-06-19 09:15:59,087 INFO [train.py:996] (0/4) Epoch 2, batch 22800, loss[loss=0.2964, simple_loss=0.3418, pruned_loss=0.1255, over 21564.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3311, pruned_loss=0.1099, over 4263175.88 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:16:25,760 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:17:09,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319890.0, ans=0.1 2023-06-19 09:18:06,539 INFO [train.py:996] (0/4) Epoch 2, batch 22850, loss[loss=0.2367, simple_loss=0.2897, pruned_loss=0.09187, over 21650.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.328, pruned_loss=0.1085, over 4262400.48 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:18:11,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=320070.0, ans=0.2 2023-06-19 09:18:43,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 3.116e+02 3.903e+02 5.067e+02 7.447e+02, threshold=7.805e+02, percent-clipped=3.0 2023-06-19 09:20:02,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=320310.0, ans=0.125 2023-06-19 09:20:03,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=320310.0, ans=0.035 2023-06-19 09:20:32,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=320370.0, ans=0.2 2023-06-19 09:20:33,842 INFO [train.py:996] (0/4) Epoch 2, batch 22900, loss[loss=0.2451, simple_loss=0.3046, pruned_loss=0.09278, over 21638.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3276, pruned_loss=0.1069, over 4268037.56 frames. ], batch size: 247, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:20:38,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320370.0, ans=0.125 2023-06-19 09:21:05,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=320430.0, ans=0.0 2023-06-19 09:21:39,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=320490.0, ans=0.0 2023-06-19 09:22:01,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320550.0, ans=0.125 2023-06-19 09:22:08,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=320550.0, ans=0.125 2023-06-19 09:22:34,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-19 09:23:04,008 INFO [train.py:996] (0/4) Epoch 2, batch 22950, loss[loss=0.3024, simple_loss=0.4231, pruned_loss=0.09087, over 21320.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3394, pruned_loss=0.1054, over 4274144.99 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:23:26,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=320730.0, ans=0.0 2023-06-19 09:23:29,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.028e+02 3.874e+02 4.799e+02 7.839e+02, threshold=7.748e+02, percent-clipped=1.0 2023-06-19 09:23:52,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=320790.0, ans=0.0 2023-06-19 09:23:56,794 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:24:57,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=320910.0, ans=0.2 2023-06-19 09:25:08,585 INFO [train.py:996] (0/4) Epoch 2, batch 23000, loss[loss=0.3118, simple_loss=0.3607, pruned_loss=0.1315, over 21848.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.339, pruned_loss=0.1027, over 4284774.06 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:26:13,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=321090.0, ans=0.0 2023-06-19 09:27:36,520 INFO [train.py:996] (0/4) Epoch 2, batch 23050, loss[loss=0.3173, simple_loss=0.3699, pruned_loss=0.1324, over 21236.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3407, pruned_loss=0.1055, over 4288056.80 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:27:43,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-19 09:27:44,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=321270.0, ans=0.0 2023-06-19 09:27:55,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.939e+02 3.561e+02 4.565e+02 8.100e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-19 09:28:43,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-19 09:29:35,910 INFO [train.py:996] (0/4) Epoch 2, batch 23100, loss[loss=0.235, simple_loss=0.2869, pruned_loss=0.09154, over 21595.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3363, pruned_loss=0.1061, over 4280936.29 frames. ], batch size: 231, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:29:37,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=321570.0, ans=0.125 2023-06-19 09:29:59,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=321630.0, ans=0.0 2023-06-19 09:30:16,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=321630.0, ans=0.0 2023-06-19 09:30:17,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321630.0, ans=0.1 2023-06-19 09:31:57,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=321810.0, ans=0.125 2023-06-19 09:32:02,681 INFO [train.py:996] (0/4) Epoch 2, batch 23150, loss[loss=0.3065, simple_loss=0.3444, pruned_loss=0.1343, over 21587.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3295, pruned_loss=0.1043, over 4269276.92 frames. ], batch size: 471, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:32:21,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.749e+02 3.350e+02 4.019e+02 8.048e+02, threshold=6.700e+02, percent-clipped=1.0 2023-06-19 09:32:37,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-19 09:33:22,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=322050.0, ans=0.125 2023-06-19 09:34:08,025 INFO [train.py:996] (0/4) Epoch 2, batch 23200, loss[loss=0.2577, simple_loss=0.3143, pruned_loss=0.1005, over 21377.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3314, pruned_loss=0.1069, over 4279500.86 frames. ], batch size: 176, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:34:09,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=322170.0, ans=0.0 2023-06-19 09:34:11,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=322170.0, ans=0.07 2023-06-19 09:34:12,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=322170.0, ans=0.125 2023-06-19 09:34:16,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.68 vs. limit=6.0 2023-06-19 09:35:08,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322290.0, ans=0.1 2023-06-19 09:35:17,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=322350.0, ans=0.125 2023-06-19 09:36:20,040 INFO [train.py:996] (0/4) Epoch 2, batch 23250, loss[loss=0.307, simple_loss=0.3573, pruned_loss=0.1283, over 21901.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3316, pruned_loss=0.1083, over 4291398.30 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:36:20,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=322470.0, ans=0.07 2023-06-19 09:36:37,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322470.0, ans=0.1 2023-06-19 09:37:00,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.833e+02 3.294e+02 4.025e+02 6.311e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 09:37:59,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=322650.0, ans=0.07 2023-06-19 09:38:55,479 INFO [train.py:996] (0/4) Epoch 2, batch 23300, loss[loss=0.3366, simple_loss=0.4001, pruned_loss=0.1366, over 21719.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3404, pruned_loss=0.1103, over 4292877.65 frames. ], batch size: 441, lr: 1.53e-02, grad_scale: 32.0 2023-06-19 09:39:53,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=322890.0, ans=0.0 2023-06-19 09:40:11,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=322890.0, ans=0.125 2023-06-19 09:40:37,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=322950.0, ans=0.0 2023-06-19 09:40:54,653 INFO [train.py:996] (0/4) Epoch 2, batch 23350, loss[loss=0.213, simple_loss=0.2762, pruned_loss=0.07487, over 21206.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3438, pruned_loss=0.1088, over 4287914.85 frames. ], batch size: 143, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:41:38,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.727e+02 3.213e+02 3.768e+02 7.532e+02, threshold=6.426e+02, percent-clipped=1.0 2023-06-19 09:42:05,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=323190.0, ans=0.07 2023-06-19 09:43:21,908 INFO [train.py:996] (0/4) Epoch 2, batch 23400, loss[loss=0.2369, simple_loss=0.3003, pruned_loss=0.08675, over 21316.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3364, pruned_loss=0.1037, over 4293707.64 frames. ], batch size: 159, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:44:05,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-19 09:44:43,233 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:45:40,849 INFO [train.py:996] (0/4) Epoch 2, batch 23450, loss[loss=0.3236, simple_loss=0.3745, pruned_loss=0.1364, over 21326.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3362, pruned_loss=0.1063, over 4294442.05 frames. ], batch size: 176, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:46:22,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.895e+02 3.507e+02 4.209e+02 6.661e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-19 09:46:54,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=323790.0, ans=0.02 2023-06-19 09:48:05,215 INFO [train.py:996] (0/4) Epoch 2, batch 23500, loss[loss=0.2863, simple_loss=0.3878, pruned_loss=0.09241, over 19869.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3385, pruned_loss=0.1082, over 4286616.19 frames. ], batch size: 702, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:50:01,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=324210.0, ans=22.5 2023-06-19 09:50:09,934 INFO [train.py:996] (0/4) Epoch 2, batch 23550, loss[loss=0.2686, simple_loss=0.3207, pruned_loss=0.1083, over 21872.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3341, pruned_loss=0.1082, over 4279571.05 frames. ], batch size: 107, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:50:14,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=324270.0, ans=0.1 2023-06-19 09:50:32,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.771e+02 3.237e+02 3.882e+02 7.021e+02, threshold=6.473e+02, percent-clipped=1.0 2023-06-19 09:50:43,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-19 09:51:18,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324450.0, ans=0.1 2023-06-19 09:51:19,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=324450.0, ans=0.125 2023-06-19 09:51:51,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=324510.0, ans=0.0 2023-06-19 09:52:12,580 INFO [train.py:996] (0/4) Epoch 2, batch 23600, loss[loss=0.2716, simple_loss=0.3326, pruned_loss=0.1053, over 21712.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3337, pruned_loss=0.1089, over 4280629.81 frames. ], batch size: 298, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:52:20,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=324570.0, ans=0.125 2023-06-19 09:53:17,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.96 vs. limit=22.5 2023-06-19 09:53:18,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=22.5 2023-06-19 09:53:23,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=324690.0, ans=0.125 2023-06-19 09:54:27,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324810.0, ans=0.1 2023-06-19 09:54:36,391 INFO [train.py:996] (0/4) Epoch 2, batch 23650, loss[loss=0.24, simple_loss=0.3128, pruned_loss=0.08364, over 21628.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3352, pruned_loss=0.1081, over 4282839.00 frames. ], batch size: 230, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:54:38,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=324870.0, ans=0.0 2023-06-19 09:55:30,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.894e+02 3.663e+02 5.031e+02 1.050e+03, threshold=7.326e+02, percent-clipped=9.0 2023-06-19 09:55:39,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=324990.0, ans=0.2 2023-06-19 09:55:40,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.15 vs. limit=6.0 2023-06-19 09:55:41,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=324990.0, ans=0.125 2023-06-19 09:55:44,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=324990.0, ans=0.0 2023-06-19 09:56:00,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-19 09:56:22,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=325050.0, ans=0.125 2023-06-19 09:56:26,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=325050.0, ans=0.2 2023-06-19 09:57:18,211 INFO [train.py:996] (0/4) Epoch 2, batch 23700, loss[loss=0.3043, simple_loss=0.3577, pruned_loss=0.1254, over 21278.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3393, pruned_loss=0.1081, over 4282997.68 frames. ], batch size: 143, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:57:36,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=325170.0, ans=0.125 2023-06-19 09:58:41,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=325350.0, ans=0.125 2023-06-19 09:59:11,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325410.0, ans=0.125 2023-06-19 09:59:30,271 INFO [train.py:996] (0/4) Epoch 2, batch 23750, loss[loss=0.2435, simple_loss=0.3169, pruned_loss=0.08506, over 21302.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3401, pruned_loss=0.108, over 4282815.49 frames. ], batch size: 176, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 09:59:40,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=325470.0, ans=0.2 2023-06-19 10:00:11,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325530.0, ans=0.125 2023-06-19 10:00:13,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.910e+02 3.446e+02 4.097e+02 6.175e+02, threshold=6.892e+02, percent-clipped=0.0 2023-06-19 10:00:34,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=325590.0, ans=0.125 2023-06-19 10:00:35,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=325590.0, ans=0.0 2023-06-19 10:01:01,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-19 10:01:35,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=325710.0, ans=10.0 2023-06-19 10:01:52,695 INFO [train.py:996] (0/4) Epoch 2, batch 23800, loss[loss=0.3427, simple_loss=0.416, pruned_loss=0.1347, over 21635.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3391, pruned_loss=0.1058, over 4285232.64 frames. ], batch size: 414, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:01:59,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=325770.0, ans=0.0 2023-06-19 10:02:14,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.55 vs. limit=15.0 2023-06-19 10:02:29,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=325830.0, ans=0.125 2023-06-19 10:02:33,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-19 10:02:34,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=325890.0, ans=0.0 2023-06-19 10:03:35,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=325950.0, ans=0.125 2023-06-19 10:03:37,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=326010.0, ans=0.125 2023-06-19 10:03:58,554 INFO [train.py:996] (0/4) Epoch 2, batch 23850, loss[loss=0.3192, simple_loss=0.3708, pruned_loss=0.1338, over 21528.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3502, pruned_loss=0.1088, over 4278922.08 frames. ], batch size: 194, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:04:40,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=326130.0, ans=0.1 2023-06-19 10:04:41,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.956e+02 3.514e+02 4.313e+02 7.177e+02, threshold=7.028e+02, percent-clipped=1.0 2023-06-19 10:05:02,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=326190.0, ans=0.1 2023-06-19 10:05:40,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=326250.0, ans=0.0 2023-06-19 10:05:42,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=326250.0, ans=0.0 2023-06-19 10:05:54,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=326250.0, ans=0.125 2023-06-19 10:06:20,020 INFO [train.py:996] (0/4) Epoch 2, batch 23900, loss[loss=0.296, simple_loss=0.3546, pruned_loss=0.1187, over 21504.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3568, pruned_loss=0.1111, over 4280652.33 frames. ], batch size: 389, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:06:23,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=326370.0, ans=0.125 2023-06-19 10:06:23,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=326370.0, ans=0.2 2023-06-19 10:08:01,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-19 10:08:19,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-19 10:08:27,711 INFO [train.py:996] (0/4) Epoch 2, batch 23950, loss[loss=0.297, simple_loss=0.3462, pruned_loss=0.124, over 21883.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.351, pruned_loss=0.1115, over 4280073.79 frames. ], batch size: 372, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:09:19,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.788e+02 3.140e+02 3.660e+02 5.549e+02, threshold=6.280e+02, percent-clipped=0.0 2023-06-19 10:10:45,250 INFO [train.py:996] (0/4) Epoch 2, batch 24000, loss[loss=0.3438, simple_loss=0.3961, pruned_loss=0.1457, over 21596.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3519, pruned_loss=0.1147, over 4282898.66 frames. ], batch size: 415, lr: 1.52e-02, grad_scale: 32.0 2023-06-19 10:10:45,252 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 10:11:36,135 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2838, simple_loss=0.3817, pruned_loss=0.09297, over 1796401.00 frames. 2023-06-19 10:11:36,135 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 10:11:40,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-19 10:12:11,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327030.0, ans=0.1 2023-06-19 10:12:11,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=327030.0, ans=0.2 2023-06-19 10:12:23,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327090.0, ans=0.1 2023-06-19 10:12:31,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-19 10:13:39,999 INFO [train.py:996] (0/4) Epoch 2, batch 24050, loss[loss=0.2661, simple_loss=0.3481, pruned_loss=0.09206, over 21650.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3538, pruned_loss=0.1152, over 4287281.43 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:13:40,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=327270.0, ans=0.125 2023-06-19 10:14:14,498 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.893e+02 3.313e+02 4.221e+02 6.916e+02, threshold=6.626e+02, percent-clipped=2.0 2023-06-19 10:14:24,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327330.0, ans=0.1 2023-06-19 10:15:17,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=327450.0, ans=0.125 2023-06-19 10:15:18,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=327450.0, ans=0.0 2023-06-19 10:15:58,855 INFO [train.py:996] (0/4) Epoch 2, batch 24100, loss[loss=0.3232, simple_loss=0.3873, pruned_loss=0.1295, over 21672.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3529, pruned_loss=0.1132, over 4282488.90 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:16:24,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=327630.0, ans=0.05 2023-06-19 10:16:38,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327630.0, ans=0.1 2023-06-19 10:18:09,450 INFO [train.py:996] (0/4) Epoch 2, batch 24150, loss[loss=0.2788, simple_loss=0.3317, pruned_loss=0.113, over 21268.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3512, pruned_loss=0.1142, over 4287956.91 frames. ], batch size: 176, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:18:12,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327870.0, ans=0.125 2023-06-19 10:18:48,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.793e+02 3.020e+02 3.796e+02 7.586e+02, threshold=6.040e+02, percent-clipped=1.0 2023-06-19 10:18:56,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.89 vs. limit=6.0 2023-06-19 10:19:13,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-19 10:19:48,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=328050.0, ans=0.125 2023-06-19 10:19:52,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=328110.0, ans=0.0 2023-06-19 10:20:39,211 INFO [train.py:996] (0/4) Epoch 2, batch 24200, loss[loss=0.2526, simple_loss=0.3246, pruned_loss=0.09034, over 21663.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3559, pruned_loss=0.1172, over 4290622.76 frames. ], batch size: 247, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:21:30,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=328290.0, ans=0.125 2023-06-19 10:22:58,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=328470.0, ans=0.125 2023-06-19 10:22:59,857 INFO [train.py:996] (0/4) Epoch 2, batch 24250, loss[loss=0.2227, simple_loss=0.3063, pruned_loss=0.06951, over 21641.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3519, pruned_loss=0.1093, over 4282985.33 frames. ], batch size: 230, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:23:01,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.56 vs. limit=15.0 2023-06-19 10:23:04,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=328470.0, ans=0.125 2023-06-19 10:23:42,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-19 10:23:44,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.606e+02 2.962e+02 3.428e+02 4.906e+02, threshold=5.923e+02, percent-clipped=0.0 2023-06-19 10:23:51,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=328590.0, ans=0.125 2023-06-19 10:24:26,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=328650.0, ans=0.07 2023-06-19 10:24:31,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-19 10:25:23,882 INFO [train.py:996] (0/4) Epoch 2, batch 24300, loss[loss=0.2161, simple_loss=0.2959, pruned_loss=0.06818, over 21788.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.343, pruned_loss=0.1018, over 4274949.05 frames. ], batch size: 332, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:26:07,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=328830.0, ans=0.09899494936611666 2023-06-19 10:26:40,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=328950.0, ans=0.0 2023-06-19 10:26:40,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328950.0, ans=0.1 2023-06-19 10:27:07,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=328950.0, ans=10.0 2023-06-19 10:27:36,334 INFO [train.py:996] (0/4) Epoch 2, batch 24350, loss[loss=0.2916, simple_loss=0.3459, pruned_loss=0.1187, over 21424.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3382, pruned_loss=0.1016, over 4282783.59 frames. ], batch size: 211, lr: 1.51e-02, grad_scale: 16.0 2023-06-19 10:27:38,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=329070.0, ans=0.125 2023-06-19 10:27:55,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=329130.0, ans=0.05 2023-06-19 10:28:15,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 2.750e+02 3.153e+02 3.822e+02 5.866e+02, threshold=6.306e+02, percent-clipped=0.0 2023-06-19 10:28:18,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=329130.0, ans=0.05 2023-06-19 10:28:37,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=329190.0, ans=0.125 2023-06-19 10:28:38,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=329190.0, ans=0.125 2023-06-19 10:28:55,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=329190.0, ans=0.0 2023-06-19 10:29:45,925 INFO [train.py:996] (0/4) Epoch 2, batch 24400, loss[loss=0.2885, simple_loss=0.3403, pruned_loss=0.1183, over 21869.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3449, pruned_loss=0.1068, over 4282059.06 frames. ], batch size: 107, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:30:12,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=329370.0, ans=0.0 2023-06-19 10:30:46,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=329430.0, ans=0.125 2023-06-19 10:30:52,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-19 10:31:22,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=329550.0, ans=0.0 2023-06-19 10:31:55,278 INFO [train.py:996] (0/4) Epoch 2, batch 24450, loss[loss=0.3813, simple_loss=0.4527, pruned_loss=0.1549, over 21699.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.347, pruned_loss=0.1083, over 4277049.28 frames. ], batch size: 414, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:32:36,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.674e+02 3.123e+02 3.615e+02 7.708e+02, threshold=6.247e+02, percent-clipped=3.0 2023-06-19 10:32:36,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=329730.0, ans=0.125 2023-06-19 10:33:48,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-19 10:33:55,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=329910.0, ans=0.0 2023-06-19 10:34:09,714 INFO [train.py:996] (0/4) Epoch 2, batch 24500, loss[loss=0.2487, simple_loss=0.304, pruned_loss=0.09675, over 21180.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3485, pruned_loss=0.1079, over 4285051.30 frames. ], batch size: 607, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:34:26,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329970.0, ans=0.1 2023-06-19 10:34:30,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=329970.0, ans=0.02 2023-06-19 10:35:06,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=330090.0, ans=0.125 2023-06-19 10:35:42,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=330150.0, ans=0.2 2023-06-19 10:35:55,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=330150.0, ans=0.125 2023-06-19 10:36:35,175 INFO [train.py:996] (0/4) Epoch 2, batch 24550, loss[loss=0.2737, simple_loss=0.338, pruned_loss=0.1047, over 21551.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3515, pruned_loss=0.1111, over 4288134.51 frames. ], batch size: 112, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:36:36,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=330270.0, ans=0.125 2023-06-19 10:36:38,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-19 10:36:47,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=330270.0, ans=0.125 2023-06-19 10:37:10,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=330330.0, ans=10.0 2023-06-19 10:37:13,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.779e+02 3.400e+02 4.076e+02 5.829e+02, threshold=6.799e+02, percent-clipped=0.0 2023-06-19 10:37:15,422 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:38:43,070 INFO [train.py:996] (0/4) Epoch 2, batch 24600, loss[loss=0.23, simple_loss=0.287, pruned_loss=0.08646, over 21899.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3456, pruned_loss=0.1114, over 4289818.46 frames. ], batch size: 113, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:39:31,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-19 10:39:32,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330630.0, ans=0.1 2023-06-19 10:40:46,682 INFO [train.py:996] (0/4) Epoch 2, batch 24650, loss[loss=0.2713, simple_loss=0.3343, pruned_loss=0.1041, over 21278.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3351, pruned_loss=0.1086, over 4289696.19 frames. ], batch size: 549, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:41:26,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=330930.0, ans=0.125 2023-06-19 10:41:27,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.833e+02 3.466e+02 4.292e+02 5.874e+02, threshold=6.932e+02, percent-clipped=0.0 2023-06-19 10:42:14,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=331050.0, ans=0.125 2023-06-19 10:42:31,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=331050.0, ans=0.025 2023-06-19 10:43:00,296 INFO [train.py:996] (0/4) Epoch 2, batch 24700, loss[loss=0.2424, simple_loss=0.304, pruned_loss=0.09039, over 21598.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3331, pruned_loss=0.1067, over 4279246.33 frames. ], batch size: 247, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:43:34,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=331230.0, ans=0.0 2023-06-19 10:43:34,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=331230.0, ans=0.125 2023-06-19 10:43:56,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=22.5 2023-06-19 10:44:39,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331410.0, ans=0.1 2023-06-19 10:45:02,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=331470.0, ans=0.125 2023-06-19 10:45:02,944 INFO [train.py:996] (0/4) Epoch 2, batch 24750, loss[loss=0.2184, simple_loss=0.2664, pruned_loss=0.08522, over 21272.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3253, pruned_loss=0.1027, over 4280772.63 frames. ], batch size: 551, lr: 1.51e-02, grad_scale: 32.0 2023-06-19 10:45:08,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 10:45:40,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.432e+02 2.874e+02 3.533e+02 8.125e+02, threshold=5.749e+02, percent-clipped=3.0 2023-06-19 10:47:10,094 INFO [train.py:996] (0/4) Epoch 2, batch 24800, loss[loss=0.2774, simple_loss=0.3245, pruned_loss=0.1152, over 21802.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3214, pruned_loss=0.1023, over 4284853.55 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:47:22,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=331770.0, ans=0.0 2023-06-19 10:48:16,761 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:48:18,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=331890.0, ans=0.2 2023-06-19 10:49:33,421 INFO [train.py:996] (0/4) Epoch 2, batch 24850, loss[loss=0.308, simple_loss=0.3635, pruned_loss=0.1262, over 21734.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3236, pruned_loss=0.105, over 4294464.44 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:50:13,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.964e+02 3.527e+02 4.123e+02 6.284e+02, threshold=7.055e+02, percent-clipped=5.0 2023-06-19 10:50:26,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332190.0, ans=0.125 2023-06-19 10:50:54,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=332250.0, ans=0.0 2023-06-19 10:51:18,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=332250.0, ans=0.0 2023-06-19 10:51:21,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=332250.0, ans=0.125 2023-06-19 10:51:25,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-19 10:51:55,285 INFO [train.py:996] (0/4) Epoch 2, batch 24900, loss[loss=0.3541, simple_loss=0.3932, pruned_loss=0.1575, over 21764.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.33, pruned_loss=0.1083, over 4292483.56 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:52:49,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332430.0, ans=0.1 2023-06-19 10:53:22,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=332490.0, ans=0.2 2023-06-19 10:53:27,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332550.0, ans=0.125 2023-06-19 10:53:46,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=332610.0, ans=0.125 2023-06-19 10:53:48,231 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:53:48,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-19 10:54:13,020 INFO [train.py:996] (0/4) Epoch 2, batch 24950, loss[loss=0.2581, simple_loss=0.293, pruned_loss=0.1116, over 20171.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3374, pruned_loss=0.1127, over 4285685.22 frames. ], batch size: 703, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:54:43,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332730.0, ans=0.1 2023-06-19 10:54:53,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=332730.0, ans=0.025 2023-06-19 10:54:54,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.083e+02 3.929e+02 5.227e+02 8.953e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-19 10:56:37,136 INFO [train.py:996] (0/4) Epoch 2, batch 25000, loss[loss=0.3293, simple_loss=0.3682, pruned_loss=0.1452, over 21338.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3426, pruned_loss=0.114, over 4284888.54 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:56:37,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=332970.0, ans=0.125 2023-06-19 10:57:30,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=333090.0, ans=10.0 2023-06-19 10:57:34,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-19 10:57:41,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333150.0, ans=0.1 2023-06-19 10:57:56,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=333150.0, ans=0.125 2023-06-19 10:58:04,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-19 10:58:31,342 INFO [train.py:996] (0/4) Epoch 2, batch 25050, loss[loss=0.2622, simple_loss=0.3085, pruned_loss=0.108, over 21597.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3359, pruned_loss=0.112, over 4283661.94 frames. ], batch size: 415, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 10:58:40,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333270.0, ans=0.1 2023-06-19 10:59:09,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.713e+02 2.983e+02 3.396e+02 5.843e+02, threshold=5.966e+02, percent-clipped=0.0 2023-06-19 10:59:12,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=333330.0, ans=0.125 2023-06-19 10:59:29,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=333390.0, ans=0.125 2023-06-19 10:59:33,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=333450.0, ans=0.0 2023-06-19 11:00:03,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=333510.0, ans=0.125 2023-06-19 11:00:06,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=333510.0, ans=0.125 2023-06-19 11:00:30,862 INFO [train.py:996] (0/4) Epoch 2, batch 25100, loss[loss=0.2531, simple_loss=0.3051, pruned_loss=0.1006, over 21333.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3302, pruned_loss=0.1093, over 4275647.59 frames. ], batch size: 144, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:00:41,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=333570.0, ans=0.0 2023-06-19 11:01:14,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=333630.0, ans=0.125 2023-06-19 11:01:55,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=333750.0, ans=10.0 2023-06-19 11:02:36,048 INFO [train.py:996] (0/4) Epoch 2, batch 25150, loss[loss=0.275, simple_loss=0.3593, pruned_loss=0.09534, over 21361.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.334, pruned_loss=0.1064, over 4281839.45 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:02:59,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=333870.0, ans=0.0 2023-06-19 11:03:12,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.528e+02 3.068e+02 3.908e+02 5.226e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-19 11:03:30,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-19 11:03:37,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=333990.0, ans=0.2 2023-06-19 11:03:51,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334050.0, ans=0.1 2023-06-19 11:03:55,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=15.0 2023-06-19 11:04:47,130 INFO [train.py:996] (0/4) Epoch 2, batch 25200, loss[loss=0.2396, simple_loss=0.3267, pruned_loss=0.07628, over 21785.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3316, pruned_loss=0.1031, over 4282455.98 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:06:33,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=334410.0, ans=0.0 2023-06-19 11:06:47,739 INFO [train.py:996] (0/4) Epoch 2, batch 25250, loss[loss=0.3078, simple_loss=0.3445, pruned_loss=0.1356, over 21282.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3278, pruned_loss=0.1008, over 4251444.23 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:07:20,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.794e+02 3.264e+02 4.171e+02 6.058e+02, threshold=6.527e+02, percent-clipped=0.0 2023-06-19 11:07:36,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=334530.0, ans=0.035 2023-06-19 11:07:54,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=334590.0, ans=0.125 2023-06-19 11:07:55,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-19 11:08:06,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=334650.0, ans=0.125 2023-06-19 11:08:11,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334650.0, ans=0.1 2023-06-19 11:08:14,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=334650.0, ans=0.125 2023-06-19 11:08:27,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-19 11:09:01,261 INFO [train.py:996] (0/4) Epoch 2, batch 25300, loss[loss=0.2938, simple_loss=0.33, pruned_loss=0.1288, over 20128.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3261, pruned_loss=0.1005, over 4261460.76 frames. ], batch size: 703, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:10:20,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=15.0 2023-06-19 11:10:23,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=334950.0, ans=0.2 2023-06-19 11:10:51,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=335010.0, ans=0.2 2023-06-19 11:11:15,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=15.0 2023-06-19 11:11:16,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=335010.0, ans=0.125 2023-06-19 11:11:20,604 INFO [train.py:996] (0/4) Epoch 2, batch 25350, loss[loss=0.2336, simple_loss=0.3081, pruned_loss=0.07948, over 21789.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.329, pruned_loss=0.0997, over 4264381.67 frames. ], batch size: 317, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:11:44,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-19 11:11:52,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.646e+02 3.240e+02 3.926e+02 6.087e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-19 11:13:16,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.51 vs. limit=15.0 2023-06-19 11:13:19,496 INFO [train.py:996] (0/4) Epoch 2, batch 25400, loss[loss=0.2773, simple_loss=0.3262, pruned_loss=0.1142, over 21774.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3242, pruned_loss=0.09854, over 4257185.63 frames. ], batch size: 371, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:13:40,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=335430.0, ans=0.125 2023-06-19 11:14:30,907 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:14:32,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-19 11:15:31,563 INFO [train.py:996] (0/4) Epoch 2, batch 25450, loss[loss=0.2874, simple_loss=0.3579, pruned_loss=0.1084, over 21818.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3252, pruned_loss=0.1007, over 4264355.13 frames. ], batch size: 414, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:16:04,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.784e+02 3.356e+02 3.897e+02 6.428e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 11:16:10,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=335730.0, ans=0.125 2023-06-19 11:16:21,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335790.0, ans=0.1 2023-06-19 11:17:21,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-19 11:17:52,955 INFO [train.py:996] (0/4) Epoch 2, batch 25500, loss[loss=0.2965, simple_loss=0.3585, pruned_loss=0.1172, over 21350.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3249, pruned_loss=0.09719, over 4260645.20 frames. ], batch size: 159, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 11:17:59,336 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-56000.pt 2023-06-19 11:18:18,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=336030.0, ans=0.125 2023-06-19 11:19:30,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=336150.0, ans=0.125 2023-06-19 11:20:00,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=336210.0, ans=0.0 2023-06-19 11:20:10,338 INFO [train.py:996] (0/4) Epoch 2, batch 25550, loss[loss=0.2254, simple_loss=0.3092, pruned_loss=0.07078, over 21452.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3317, pruned_loss=0.09799, over 4265406.73 frames. ], batch size: 194, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:20:39,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=336330.0, ans=0.0 2023-06-19 11:20:46,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.372e+02 2.803e+02 3.273e+02 5.075e+02, threshold=5.607e+02, percent-clipped=0.0 2023-06-19 11:21:53,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336450.0, ans=0.1 2023-06-19 11:21:54,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-19 11:22:43,076 INFO [train.py:996] (0/4) Epoch 2, batch 25600, loss[loss=0.2456, simple_loss=0.35, pruned_loss=0.07063, over 19849.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3377, pruned_loss=0.09973, over 4266540.03 frames. ], batch size: 702, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:23:39,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=336690.0, ans=0.2 2023-06-19 11:23:48,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=336750.0, ans=0.0 2023-06-19 11:24:25,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=336810.0, ans=0.125 2023-06-19 11:24:46,824 INFO [train.py:996] (0/4) Epoch 2, batch 25650, loss[loss=0.259, simple_loss=0.3139, pruned_loss=0.102, over 21874.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3384, pruned_loss=0.1024, over 4262462.53 frames. ], batch size: 98, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:24:47,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=336870.0, ans=0.2 2023-06-19 11:25:13,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.891e+02 3.339e+02 4.051e+02 5.669e+02, threshold=6.678e+02, percent-clipped=1.0 2023-06-19 11:25:15,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=336930.0, ans=0.025 2023-06-19 11:25:24,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-19 11:25:25,776 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:26:48,353 INFO [train.py:996] (0/4) Epoch 2, batch 25700, loss[loss=0.2498, simple_loss=0.3196, pruned_loss=0.08998, over 21616.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3356, pruned_loss=0.1034, over 4269559.63 frames. ], batch size: 230, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:27:53,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-19 11:28:20,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2023-06-19 11:28:23,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=337350.0, ans=0.0 2023-06-19 11:29:14,401 INFO [train.py:996] (0/4) Epoch 2, batch 25750, loss[loss=0.2999, simple_loss=0.3609, pruned_loss=0.1194, over 21611.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3412, pruned_loss=0.1075, over 4268224.86 frames. ], batch size: 230, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:30:07,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.884e+02 3.379e+02 4.155e+02 6.943e+02, threshold=6.757e+02, percent-clipped=1.0 2023-06-19 11:31:27,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=337710.0, ans=0.0 2023-06-19 11:31:34,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=337710.0, ans=0.0 2023-06-19 11:31:34,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337710.0, ans=0.1 2023-06-19 11:31:43,381 INFO [train.py:996] (0/4) Epoch 2, batch 25800, loss[loss=0.2932, simple_loss=0.3603, pruned_loss=0.113, over 21304.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3548, pruned_loss=0.1132, over 4269113.53 frames. ], batch size: 548, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:32:21,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=337770.0, ans=0.0 2023-06-19 11:32:33,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=337830.0, ans=0.125 2023-06-19 11:34:19,980 INFO [train.py:996] (0/4) Epoch 2, batch 25850, loss[loss=0.2632, simple_loss=0.3204, pruned_loss=0.103, over 21508.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3552, pruned_loss=0.1128, over 4268661.31 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:34:39,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=338070.0, ans=0.125 2023-06-19 11:35:07,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.249e+02 2.822e+02 3.088e+02 3.805e+02 7.433e+02, threshold=6.176e+02, percent-clipped=1.0 2023-06-19 11:35:18,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=338130.0, ans=0.2 2023-06-19 11:35:35,957 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:35:45,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-19 11:36:04,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=338310.0, ans=0.0 2023-06-19 11:36:45,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=338370.0, ans=0.125 2023-06-19 11:36:46,149 INFO [train.py:996] (0/4) Epoch 2, batch 25900, loss[loss=0.3512, simple_loss=0.4269, pruned_loss=0.1378, over 21717.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3558, pruned_loss=0.1135, over 4268852.82 frames. ], batch size: 389, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:37:52,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338490.0, ans=0.125 2023-06-19 11:38:14,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=338550.0, ans=0.125 2023-06-19 11:38:23,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=338550.0, ans=0.125 2023-06-19 11:38:42,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=338610.0, ans=0.0 2023-06-19 11:39:03,313 INFO [train.py:996] (0/4) Epoch 2, batch 25950, loss[loss=0.2984, simple_loss=0.3608, pruned_loss=0.118, over 21622.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3627, pruned_loss=0.1167, over 4268868.65 frames. ], batch size: 263, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:39:36,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.799e+02 3.323e+02 3.883e+02 7.298e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 11:39:36,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338730.0, ans=0.125 2023-06-19 11:39:38,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=338730.0, ans=0.125 2023-06-19 11:40:04,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-19 11:40:18,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338790.0, ans=0.1 2023-06-19 11:41:30,341 INFO [train.py:996] (0/4) Epoch 2, batch 26000, loss[loss=0.3223, simple_loss=0.3905, pruned_loss=0.1271, over 21681.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3615, pruned_loss=0.1149, over 4271205.21 frames. ], batch size: 351, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:41:45,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=338970.0, ans=0.125 2023-06-19 11:41:47,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-19 11:41:59,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=339030.0, ans=0.125 2023-06-19 11:42:48,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.99 vs. limit=15.0 2023-06-19 11:42:48,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=15.0 2023-06-19 11:43:33,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=339210.0, ans=0.025 2023-06-19 11:43:47,050 INFO [train.py:996] (0/4) Epoch 2, batch 26050, loss[loss=0.2715, simple_loss=0.3334, pruned_loss=0.1048, over 21907.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.362, pruned_loss=0.1164, over 4272615.75 frames. ], batch size: 113, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:44:17,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.721e+02 3.311e+02 3.972e+02 6.601e+02, threshold=6.622e+02, percent-clipped=0.0 2023-06-19 11:45:44,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=339510.0, ans=0.125 2023-06-19 11:46:04,718 INFO [train.py:996] (0/4) Epoch 2, batch 26100, loss[loss=0.2937, simple_loss=0.3351, pruned_loss=0.1261, over 21726.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3557, pruned_loss=0.1149, over 4278317.28 frames. ], batch size: 473, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:46:21,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-19 11:47:52,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339750.0, ans=0.0 2023-06-19 11:48:33,526 INFO [train.py:996] (0/4) Epoch 2, batch 26150, loss[loss=0.2744, simple_loss=0.3277, pruned_loss=0.1105, over 21794.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3525, pruned_loss=0.1141, over 4282626.67 frames. ], batch size: 282, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:49:14,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.871e+02 3.280e+02 4.154e+02 8.858e+02, threshold=6.561e+02, percent-clipped=1.0 2023-06-19 11:50:09,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-19 11:50:47,962 INFO [train.py:996] (0/4) Epoch 2, batch 26200, loss[loss=0.2605, simple_loss=0.3485, pruned_loss=0.08622, over 21227.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3519, pruned_loss=0.111, over 4281102.27 frames. ], batch size: 176, lr: 1.49e-02, grad_scale: 64.0 2023-06-19 11:52:31,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=340350.0, ans=0.025 2023-06-19 11:52:48,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-19 11:52:55,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=340350.0, ans=0.0 2023-06-19 11:53:11,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=340410.0, ans=0.125 2023-06-19 11:53:25,773 INFO [train.py:996] (0/4) Epoch 2, batch 26250, loss[loss=0.2966, simple_loss=0.3577, pruned_loss=0.1178, over 21882.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3556, pruned_loss=0.1101, over 4276391.73 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:53:38,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=340470.0, ans=0.125 2023-06-19 11:54:03,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.625e+02 2.924e+02 3.538e+02 7.396e+02, threshold=5.848e+02, percent-clipped=2.0 2023-06-19 11:54:12,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340590.0, ans=0.125 2023-06-19 11:55:11,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=340650.0, ans=0.125 2023-06-19 11:55:25,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340710.0, ans=0.1 2023-06-19 11:55:28,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340710.0, ans=0.1 2023-06-19 11:55:30,358 INFO [train.py:996] (0/4) Epoch 2, batch 26300, loss[loss=0.244, simple_loss=0.2922, pruned_loss=0.0979, over 21204.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3533, pruned_loss=0.111, over 4281596.01 frames. ], batch size: 608, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 11:57:17,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=8.0 2023-06-19 11:58:04,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=341010.0, ans=0.125 2023-06-19 11:58:06,959 INFO [train.py:996] (0/4) Epoch 2, batch 26350, loss[loss=0.3065, simple_loss=0.3652, pruned_loss=0.1239, over 21682.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3516, pruned_loss=0.1115, over 4283840.39 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 11:58:26,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.24 vs. limit=10.0 2023-06-19 11:58:55,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-19 11:58:57,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.087e+02 3.544e+02 4.218e+02 8.701e+02, threshold=7.087e+02, percent-clipped=9.0 2023-06-19 11:59:36,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=341250.0, ans=0.0 2023-06-19 11:59:38,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341250.0, ans=0.1 2023-06-19 11:59:49,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341310.0, ans=0.1 2023-06-19 12:00:00,725 INFO [train.py:996] (0/4) Epoch 2, batch 26400, loss[loss=0.2625, simple_loss=0.3085, pruned_loss=0.1083, over 21827.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3445, pruned_loss=0.111, over 4284028.99 frames. ], batch size: 102, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:01:00,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-19 12:01:05,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=15.0 2023-06-19 12:02:03,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341610.0, ans=0.125 2023-06-19 12:02:25,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341610.0, ans=0.1 2023-06-19 12:02:30,915 INFO [train.py:996] (0/4) Epoch 2, batch 26450, loss[loss=0.3089, simple_loss=0.3998, pruned_loss=0.1091, over 21725.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3462, pruned_loss=0.1108, over 4284425.48 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:03:21,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.970e+02 3.486e+02 4.809e+02 7.983e+02, threshold=6.973e+02, percent-clipped=4.0 2023-06-19 12:03:33,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=341790.0, ans=0.125 2023-06-19 12:03:42,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=341790.0, ans=0.0 2023-06-19 12:03:43,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341790.0, ans=0.1 2023-06-19 12:04:34,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=341910.0, ans=0.125 2023-06-19 12:05:00,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=341910.0, ans=0.035 2023-06-19 12:05:01,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341910.0, ans=0.1 2023-06-19 12:05:04,482 INFO [train.py:996] (0/4) Epoch 2, batch 26500, loss[loss=0.2397, simple_loss=0.3169, pruned_loss=0.08123, over 21568.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3467, pruned_loss=0.1085, over 4273058.45 frames. ], batch size: 263, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:05:35,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=342030.0, ans=0.2 2023-06-19 12:05:42,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=342030.0, ans=0.035 2023-06-19 12:06:43,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=342150.0, ans=0.125 2023-06-19 12:07:39,704 INFO [train.py:996] (0/4) Epoch 2, batch 26550, loss[loss=0.28, simple_loss=0.3622, pruned_loss=0.09893, over 21477.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3418, pruned_loss=0.1052, over 4275780.38 frames. ], batch size: 471, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:08:27,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.765e+02 3.396e+02 4.650e+02 6.569e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-19 12:08:27,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=342330.0, ans=0.125 2023-06-19 12:09:19,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=342450.0, ans=0.125 2023-06-19 12:09:31,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=342510.0, ans=0.1 2023-06-19 12:09:50,672 INFO [train.py:996] (0/4) Epoch 2, batch 26600, loss[loss=0.2579, simple_loss=0.3287, pruned_loss=0.09356, over 21755.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.341, pruned_loss=0.1022, over 4267963.99 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:10:19,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=342570.0, ans=0.0 2023-06-19 12:10:29,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=342630.0, ans=0.1 2023-06-19 12:11:21,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=342750.0, ans=0.2 2023-06-19 12:11:39,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-19 12:12:01,207 INFO [train.py:996] (0/4) Epoch 2, batch 26650, loss[loss=0.1981, simple_loss=0.2836, pruned_loss=0.05626, over 21853.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3339, pruned_loss=0.101, over 4263108.61 frames. ], batch size: 352, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:12:06,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-19 12:12:30,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=342930.0, ans=0.0 2023-06-19 12:12:49,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.521e+02 2.956e+02 3.686e+02 6.672e+02, threshold=5.912e+02, percent-clipped=0.0 2023-06-19 12:14:24,018 INFO [train.py:996] (0/4) Epoch 2, batch 26700, loss[loss=0.3301, simple_loss=0.3719, pruned_loss=0.1442, over 21831.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3255, pruned_loss=0.09685, over 4272393.37 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:14:59,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=343230.0, ans=0.2 2023-06-19 12:15:54,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=343350.0, ans=0.125 2023-06-19 12:15:56,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-19 12:16:03,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=343350.0, ans=0.0 2023-06-19 12:16:20,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=343410.0, ans=0.125 2023-06-19 12:16:23,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343410.0, ans=0.1 2023-06-19 12:16:37,960 INFO [train.py:996] (0/4) Epoch 2, batch 26750, loss[loss=0.2132, simple_loss=0.2905, pruned_loss=0.06798, over 21633.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3259, pruned_loss=0.09657, over 4275766.57 frames. ], batch size: 230, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 12:17:33,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.465e+02 2.855e+02 3.693e+02 7.404e+02, threshold=5.711e+02, percent-clipped=5.0 2023-06-19 12:17:41,990 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:19:17,208 INFO [train.py:996] (0/4) Epoch 2, batch 26800, loss[loss=0.3378, simple_loss=0.4098, pruned_loss=0.1329, over 21433.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3369, pruned_loss=0.1036, over 4276617.09 frames. ], batch size: 131, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:19:20,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=343770.0, ans=0.2 2023-06-19 12:19:24,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-19 12:19:44,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=343830.0, ans=0.2 2023-06-19 12:20:01,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=343890.0, ans=0.0 2023-06-19 12:20:21,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=343890.0, ans=0.07 2023-06-19 12:20:47,670 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:21:26,712 INFO [train.py:996] (0/4) Epoch 2, batch 26850, loss[loss=0.3033, simple_loss=0.3404, pruned_loss=0.1331, over 21558.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.339, pruned_loss=0.1073, over 4279781.92 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:22:01,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.924e+02 3.551e+02 4.179e+02 8.286e+02, threshold=7.103e+02, percent-clipped=6.0 2023-06-19 12:22:01,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=344130.0, ans=0.0 2023-06-19 12:22:59,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-19 12:23:00,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=344310.0, ans=0.0 2023-06-19 12:23:19,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=344310.0, ans=0.125 2023-06-19 12:23:20,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.07 vs. limit=22.5 2023-06-19 12:23:24,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=344310.0, ans=0.2 2023-06-19 12:23:26,398 INFO [train.py:996] (0/4) Epoch 2, batch 26900, loss[loss=0.2496, simple_loss=0.2959, pruned_loss=0.1016, over 21579.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3303, pruned_loss=0.1055, over 4272343.42 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:24:09,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-19 12:25:05,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=344610.0, ans=0.0 2023-06-19 12:25:37,856 INFO [train.py:996] (0/4) Epoch 2, batch 26950, loss[loss=0.287, simple_loss=0.3573, pruned_loss=0.1083, over 21731.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.33, pruned_loss=0.1054, over 4269535.80 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:25:43,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344670.0, ans=0.1 2023-06-19 12:26:25,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.623e+02 3.082e+02 3.619e+02 5.950e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-19 12:26:47,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=344790.0, ans=0.125 2023-06-19 12:27:02,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=344850.0, ans=0.0 2023-06-19 12:27:10,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=344850.0, ans=15.0 2023-06-19 12:27:33,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=344910.0, ans=0.0 2023-06-19 12:27:59,446 INFO [train.py:996] (0/4) Epoch 2, batch 27000, loss[loss=0.2733, simple_loss=0.3516, pruned_loss=0.09752, over 21539.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3314, pruned_loss=0.1033, over 4273836.88 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:27:59,447 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 12:28:53,150 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.9918, 2.0554, 3.3033, 2.1030], device='cuda:0') 2023-06-19 12:28:58,708 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2596, simple_loss=0.3558, pruned_loss=0.08164, over 1796401.00 frames. 2023-06-19 12:28:58,717 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 12:29:18,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=344970.0, ans=0.0 2023-06-19 12:29:34,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=345030.0, ans=0.125 2023-06-19 12:30:27,077 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=22.5 2023-06-19 12:30:35,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=345210.0, ans=0.125 2023-06-19 12:30:48,514 INFO [train.py:996] (0/4) Epoch 2, batch 27050, loss[loss=0.2294, simple_loss=0.3215, pruned_loss=0.06865, over 21463.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3315, pruned_loss=0.09847, over 4273335.76 frames. ], batch size: 211, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:31:23,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.340e+02 2.656e+02 3.215e+02 6.128e+02, threshold=5.313e+02, percent-clipped=0.0 2023-06-19 12:31:24,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-19 12:31:53,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=345390.0, ans=0.1 2023-06-19 12:33:09,922 INFO [train.py:996] (0/4) Epoch 2, batch 27100, loss[loss=0.2734, simple_loss=0.3497, pruned_loss=0.09851, over 21204.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3346, pruned_loss=0.1005, over 4270937.06 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 12:34:41,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=345750.0, ans=0.0 2023-06-19 12:34:43,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=345750.0, ans=0.0 2023-06-19 12:34:46,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345810.0, ans=0.1 2023-06-19 12:35:13,362 INFO [train.py:996] (0/4) Epoch 2, batch 27150, loss[loss=0.3567, simple_loss=0.4325, pruned_loss=0.1405, over 21675.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3457, pruned_loss=0.1033, over 4280055.86 frames. ], batch size: 414, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:35:28,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=345930.0, ans=0.2 2023-06-19 12:35:36,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.859e+02 3.282e+02 3.762e+02 6.108e+02, threshold=6.564e+02, percent-clipped=5.0 2023-06-19 12:36:00,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=345990.0, ans=0.0 2023-06-19 12:36:04,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=345990.0, ans=0.0 2023-06-19 12:36:14,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=345990.0, ans=0.0 2023-06-19 12:36:56,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=346110.0, ans=0.125 2023-06-19 12:37:10,440 INFO [train.py:996] (0/4) Epoch 2, batch 27200, loss[loss=0.3816, simple_loss=0.423, pruned_loss=0.1701, over 21443.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3555, pruned_loss=0.1077, over 4279825.11 frames. ], batch size: 471, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:37:14,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346170.0, ans=0.1 2023-06-19 12:38:02,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-19 12:38:55,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-19 12:38:55,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-19 12:39:07,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346410.0, ans=0.1 2023-06-19 12:39:10,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=346410.0, ans=0.0 2023-06-19 12:39:11,466 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:39:12,436 INFO [train.py:996] (0/4) Epoch 2, batch 27250, loss[loss=0.3044, simple_loss=0.3662, pruned_loss=0.1213, over 21586.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3584, pruned_loss=0.1128, over 4275439.97 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:39:30,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=346470.0, ans=0.125 2023-06-19 12:39:36,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=346470.0, ans=0.125 2023-06-19 12:40:02,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.829e+02 3.133e+02 3.689e+02 6.969e+02, threshold=6.265e+02, percent-clipped=1.0 2023-06-19 12:40:52,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346650.0, ans=0.1 2023-06-19 12:40:57,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-06-19 12:41:38,441 INFO [train.py:996] (0/4) Epoch 2, batch 27300, loss[loss=0.2784, simple_loss=0.3604, pruned_loss=0.09819, over 21689.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3596, pruned_loss=0.1138, over 4273898.39 frames. ], batch size: 351, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:41:39,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=346770.0, ans=0.125 2023-06-19 12:42:22,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346830.0, ans=0.1 2023-06-19 12:42:40,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-19 12:43:00,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=346950.0, ans=0.0 2023-06-19 12:43:35,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=347010.0, ans=0.125 2023-06-19 12:44:02,165 INFO [train.py:996] (0/4) Epoch 2, batch 27350, loss[loss=0.278, simple_loss=0.344, pruned_loss=0.106, over 21890.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3614, pruned_loss=0.1144, over 4277965.96 frames. ], batch size: 371, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:44:52,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-19 12:44:53,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.548e+02 2.973e+02 3.425e+02 6.706e+02, threshold=5.945e+02, percent-clipped=1.0 2023-06-19 12:45:11,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=347190.0, ans=0.125 2023-06-19 12:45:37,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 12:46:09,206 INFO [train.py:996] (0/4) Epoch 2, batch 27400, loss[loss=0.2674, simple_loss=0.3219, pruned_loss=0.1064, over 21532.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3571, pruned_loss=0.1136, over 4275862.10 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:46:30,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=347370.0, ans=0.0 2023-06-19 12:47:19,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=347550.0, ans=0.95 2023-06-19 12:47:25,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=347550.0, ans=0.125 2023-06-19 12:48:05,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347610.0, ans=0.1 2023-06-19 12:48:07,711 INFO [train.py:996] (0/4) Epoch 2, batch 27450, loss[loss=0.2984, simple_loss=0.3667, pruned_loss=0.115, over 21426.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3488, pruned_loss=0.1105, over 4278206.60 frames. ], batch size: 471, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:49:01,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.006e+02 3.444e+02 4.102e+02 6.886e+02, threshold=6.888e+02, percent-clipped=2.0 2023-06-19 12:49:13,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=347790.0, ans=0.2 2023-06-19 12:50:24,222 INFO [train.py:996] (0/4) Epoch 2, batch 27500, loss[loss=0.297, simple_loss=0.3504, pruned_loss=0.1218, over 21857.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3474, pruned_loss=0.1109, over 4280481.86 frames. ], batch size: 414, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:50:37,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=347970.0, ans=0.125 2023-06-19 12:50:55,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=348030.0, ans=0.09899494936611666 2023-06-19 12:51:04,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=348030.0, ans=0.125 2023-06-19 12:52:29,467 INFO [train.py:996] (0/4) Epoch 2, batch 27550, loss[loss=0.2174, simple_loss=0.2958, pruned_loss=0.06951, over 21560.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3403, pruned_loss=0.106, over 4284118.68 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:53:08,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.657e+02 3.197e+02 3.865e+02 5.896e+02, threshold=6.395e+02, percent-clipped=0.0 2023-06-19 12:53:45,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=348450.0, ans=0.0 2023-06-19 12:54:11,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=348510.0, ans=0.125 2023-06-19 12:54:36,376 INFO [train.py:996] (0/4) Epoch 2, batch 27600, loss[loss=0.258, simple_loss=0.3218, pruned_loss=0.09715, over 21337.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3344, pruned_loss=0.1053, over 4279014.66 frames. ], batch size: 194, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:55:05,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=348630.0, ans=0.125 2023-06-19 12:55:22,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=348690.0, ans=0.0 2023-06-19 12:55:34,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=348750.0, ans=0.0 2023-06-19 12:55:53,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=348810.0, ans=0.0 2023-06-19 12:56:19,785 INFO [train.py:996] (0/4) Epoch 2, batch 27650, loss[loss=0.3081, simple_loss=0.3697, pruned_loss=0.1233, over 21611.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3285, pruned_loss=0.1042, over 4276700.19 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:56:27,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=348870.0, ans=0.04949747468305833 2023-06-19 12:56:47,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=348930.0, ans=0.125 2023-06-19 12:56:49,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=348930.0, ans=0.125 2023-06-19 12:57:01,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.710e+02 3.133e+02 3.895e+02 7.823e+02, threshold=6.265e+02, percent-clipped=1.0 2023-06-19 12:57:23,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=349050.0, ans=0.0 2023-06-19 12:58:17,478 INFO [train.py:996] (0/4) Epoch 2, batch 27700, loss[loss=0.2176, simple_loss=0.2834, pruned_loss=0.0759, over 16500.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3279, pruned_loss=0.1017, over 4264077.72 frames. ], batch size: 61, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 12:58:39,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-19 12:59:10,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=349230.0, ans=0.2 2023-06-19 12:59:23,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=349290.0, ans=0.0 2023-06-19 13:00:15,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=349410.0, ans=0.2 2023-06-19 13:00:30,247 INFO [train.py:996] (0/4) Epoch 2, batch 27750, loss[loss=0.2439, simple_loss=0.3306, pruned_loss=0.07859, over 21831.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3309, pruned_loss=0.1021, over 4267863.70 frames. ], batch size: 371, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 13:00:42,249 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:01:11,776 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:01:12,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.697e+02 3.256e+02 3.829e+02 6.521e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 13:01:33,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=349590.0, ans=0.125 2023-06-19 13:01:44,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=349650.0, ans=0.125 2023-06-19 13:02:11,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.15 vs. limit=15.0 2023-06-19 13:02:40,129 INFO [train.py:996] (0/4) Epoch 2, batch 27800, loss[loss=0.2634, simple_loss=0.292, pruned_loss=0.1173, over 20257.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3287, pruned_loss=0.1017, over 4272190.39 frames. ], batch size: 703, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:03:14,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=349830.0, ans=0.0 2023-06-19 13:03:28,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-19 13:03:34,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-19 13:04:44,755 INFO [train.py:996] (0/4) Epoch 2, batch 27850, loss[loss=0.2654, simple_loss=0.3441, pruned_loss=0.09341, over 21770.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.33, pruned_loss=0.1045, over 4280089.52 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:05:41,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.837e+02 3.296e+02 3.956e+02 7.636e+02, threshold=6.591e+02, percent-clipped=1.0 2023-06-19 13:06:29,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-19 13:06:42,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=350310.0, ans=0.125 2023-06-19 13:06:54,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=350310.0, ans=0.125 2023-06-19 13:07:15,654 INFO [train.py:996] (0/4) Epoch 2, batch 27900, loss[loss=0.2666, simple_loss=0.3528, pruned_loss=0.09021, over 21608.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3387, pruned_loss=0.1053, over 4280315.51 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 13:08:12,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=350490.0, ans=0.0 2023-06-19 13:09:04,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=350610.0, ans=0.125 2023-06-19 13:09:13,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350610.0, ans=0.1 2023-06-19 13:09:20,639 INFO [train.py:996] (0/4) Epoch 2, batch 27950, loss[loss=0.2148, simple_loss=0.3012, pruned_loss=0.06421, over 21575.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3403, pruned_loss=0.1019, over 4280750.22 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 16.0 2023-06-19 13:09:28,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=350670.0, ans=0.0 2023-06-19 13:09:57,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.612e+02 3.167e+02 4.012e+02 7.863e+02, threshold=6.333e+02, percent-clipped=3.0 2023-06-19 13:09:58,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=350730.0, ans=0.0 2023-06-19 13:10:40,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=350790.0, ans=0.0 2023-06-19 13:11:28,374 INFO [train.py:996] (0/4) Epoch 2, batch 28000, loss[loss=0.2682, simple_loss=0.3321, pruned_loss=0.1021, over 21844.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3368, pruned_loss=0.09882, over 4287134.66 frames. ], batch size: 414, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:11:29,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=350970.0, ans=0.2 2023-06-19 13:11:42,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=351030.0, ans=0.125 2023-06-19 13:11:44,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=351030.0, ans=0.125 2023-06-19 13:12:16,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=351090.0, ans=0.125 2023-06-19 13:12:53,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=351150.0, ans=0.125 2023-06-19 13:13:30,524 INFO [train.py:996] (0/4) Epoch 2, batch 28050, loss[loss=0.2574, simple_loss=0.2914, pruned_loss=0.1117, over 20263.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3335, pruned_loss=0.1002, over 4284774.58 frames. ], batch size: 703, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:13:33,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=351270.0, ans=0.09899494936611666 2023-06-19 13:13:53,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=351330.0, ans=0.2 2023-06-19 13:14:05,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-19 13:14:33,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.018e+02 3.592e+02 4.421e+02 7.489e+02, threshold=7.184e+02, percent-clipped=8.0 2023-06-19 13:15:07,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=351450.0, ans=0.125 2023-06-19 13:15:39,816 INFO [train.py:996] (0/4) Epoch 2, batch 28100, loss[loss=0.2503, simple_loss=0.3053, pruned_loss=0.09763, over 21779.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.329, pruned_loss=0.09976, over 4282101.95 frames. ], batch size: 118, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:16:57,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-19 13:17:18,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=351810.0, ans=0.0 2023-06-19 13:17:32,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=351870.0, ans=0.95 2023-06-19 13:17:35,509 INFO [train.py:996] (0/4) Epoch 2, batch 28150, loss[loss=0.2512, simple_loss=0.2988, pruned_loss=0.1018, over 21615.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3221, pruned_loss=0.09951, over 4282918.70 frames. ], batch size: 298, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:17:35,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=351870.0, ans=0.125 2023-06-19 13:18:30,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.813e+02 3.244e+02 3.974e+02 6.604e+02, threshold=6.487e+02, percent-clipped=0.0 2023-06-19 13:19:16,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=352050.0, ans=0.125 2023-06-19 13:19:31,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=352110.0, ans=0.125 2023-06-19 13:19:37,283 INFO [train.py:996] (0/4) Epoch 2, batch 28200, loss[loss=0.2611, simple_loss=0.303, pruned_loss=0.1095, over 20684.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.322, pruned_loss=0.1021, over 4277766.99 frames. ], batch size: 607, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:20:54,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=352290.0, ans=0.125 2023-06-19 13:20:56,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=352290.0, ans=0.0 2023-06-19 13:21:32,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=352410.0, ans=0.125 2023-06-19 13:21:43,616 INFO [train.py:996] (0/4) Epoch 2, batch 28250, loss[loss=0.2697, simple_loss=0.3217, pruned_loss=0.1088, over 21196.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3297, pruned_loss=0.1069, over 4277956.44 frames. ], batch size: 176, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:22:03,982 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:22:30,579 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.279e+02 3.953e+02 4.755e+02 7.153e+02, threshold=7.906e+02, percent-clipped=2.0 2023-06-19 13:23:17,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=352650.0, ans=0.125 2023-06-19 13:23:53,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.06 vs. limit=10.0 2023-06-19 13:24:01,143 INFO [train.py:996] (0/4) Epoch 2, batch 28300, loss[loss=0.2354, simple_loss=0.3176, pruned_loss=0.07667, over 21626.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.326, pruned_loss=0.1031, over 4273644.47 frames. ], batch size: 389, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:24:27,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352830.0, ans=0.1 2023-06-19 13:25:33,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352950.0, ans=0.1 2023-06-19 13:26:04,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-19 13:26:11,216 INFO [train.py:996] (0/4) Epoch 2, batch 28350, loss[loss=0.2295, simple_loss=0.2878, pruned_loss=0.0856, over 21572.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3213, pruned_loss=0.0965, over 4262474.48 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:26:14,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=353070.0, ans=0.0 2023-06-19 13:26:45,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=353070.0, ans=0.125 2023-06-19 13:26:57,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-19 13:27:10,665 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 2.403e+02 3.006e+02 3.946e+02 7.827e+02, threshold=6.012e+02, percent-clipped=0.0 2023-06-19 13:27:24,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=353190.0, ans=0.125 2023-06-19 13:27:29,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353250.0, ans=0.1 2023-06-19 13:27:36,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-19 13:28:22,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353310.0, ans=0.1 2023-06-19 13:28:34,679 INFO [train.py:996] (0/4) Epoch 2, batch 28400, loss[loss=0.2905, simple_loss=0.3403, pruned_loss=0.1203, over 21223.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3183, pruned_loss=0.09737, over 4257599.48 frames. ], batch size: 159, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:28:55,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 13:29:52,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-19 13:30:38,153 INFO [train.py:996] (0/4) Epoch 2, batch 28450, loss[loss=0.2858, simple_loss=0.3405, pruned_loss=0.1155, over 21251.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3264, pruned_loss=0.1024, over 4263766.50 frames. ], batch size: 143, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:31:13,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-19 13:31:20,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.030e+02 3.579e+02 4.363e+02 8.439e+02, threshold=7.159e+02, percent-clipped=7.0 2023-06-19 13:31:30,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-19 13:31:33,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=353790.0, ans=0.125 2023-06-19 13:32:01,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=22.5 2023-06-19 13:32:28,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=353910.0, ans=0.125 2023-06-19 13:32:47,139 INFO [train.py:996] (0/4) Epoch 2, batch 28500, loss[loss=0.3215, simple_loss=0.3753, pruned_loss=0.1338, over 21678.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.328, pruned_loss=0.1045, over 4271044.73 frames. ], batch size: 415, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:33:34,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-19 13:34:41,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354210.0, ans=0.1 2023-06-19 13:34:50,567 INFO [train.py:996] (0/4) Epoch 2, batch 28550, loss[loss=0.285, simple_loss=0.3761, pruned_loss=0.09698, over 21620.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3375, pruned_loss=0.1072, over 4281463.77 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:35:13,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=354330.0, ans=0.125 2023-06-19 13:35:13,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=354330.0, ans=0.125 2023-06-19 13:35:35,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.880e+02 3.273e+02 3.942e+02 6.177e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-19 13:36:47,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354510.0, ans=0.1 2023-06-19 13:36:58,150 INFO [train.py:996] (0/4) Epoch 2, batch 28600, loss[loss=0.3284, simple_loss=0.3791, pruned_loss=0.1389, over 21789.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3438, pruned_loss=0.1094, over 4276862.96 frames. ], batch size: 441, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:37:12,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=354570.0, ans=0.0 2023-06-19 13:38:31,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=354750.0, ans=0.0 2023-06-19 13:38:47,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=354810.0, ans=0.0 2023-06-19 13:38:58,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354810.0, ans=0.1 2023-06-19 13:39:02,544 INFO [train.py:996] (0/4) Epoch 2, batch 28650, loss[loss=0.2692, simple_loss=0.3219, pruned_loss=0.1082, over 21691.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3379, pruned_loss=0.1087, over 4280099.84 frames. ], batch size: 333, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:39:08,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354870.0, ans=0.1 2023-06-19 13:39:44,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.911e+02 3.310e+02 3.742e+02 7.048e+02, threshold=6.621e+02, percent-clipped=2.0 2023-06-19 13:39:50,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=354930.0, ans=0.0 2023-06-19 13:41:06,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.28 vs. limit=22.5 2023-06-19 13:41:08,386 INFO [train.py:996] (0/4) Epoch 2, batch 28700, loss[loss=0.259, simple_loss=0.3226, pruned_loss=0.09774, over 21606.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3375, pruned_loss=0.1096, over 4282394.94 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 13:41:30,314 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-19 13:41:35,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=355230.0, ans=0.0 2023-06-19 13:42:15,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=355290.0, ans=0.0 2023-06-19 13:42:26,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=355350.0, ans=0.1 2023-06-19 13:42:46,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=355350.0, ans=0.125 2023-06-19 13:42:51,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-19 13:42:57,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355410.0, ans=0.1 2023-06-19 13:43:04,244 INFO [train.py:996] (0/4) Epoch 2, batch 28750, loss[loss=0.3399, simple_loss=0.3687, pruned_loss=0.1555, over 21744.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3375, pruned_loss=0.1104, over 4281224.22 frames. ], batch size: 507, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:43:58,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.722e+02 3.095e+02 3.469e+02 6.636e+02, threshold=6.190e+02, percent-clipped=1.0 2023-06-19 13:44:04,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=355590.0, ans=0.125 2023-06-19 13:44:24,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=355590.0, ans=0.125 2023-06-19 13:45:21,932 INFO [train.py:996] (0/4) Epoch 2, batch 28800, loss[loss=0.3692, simple_loss=0.4073, pruned_loss=0.1655, over 21785.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3426, pruned_loss=0.1115, over 4285326.41 frames. ], batch size: 441, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:46:08,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=355830.0, ans=0.125 2023-06-19 13:46:08,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=355830.0, ans=0.0 2023-06-19 13:47:33,449 INFO [train.py:996] (0/4) Epoch 2, batch 28850, loss[loss=0.2681, simple_loss=0.3182, pruned_loss=0.109, over 21122.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3446, pruned_loss=0.114, over 4294103.13 frames. ], batch size: 607, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:47:33,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=356070.0, ans=0.05 2023-06-19 13:48:13,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 2.928e+02 3.488e+02 4.240e+02 7.326e+02, threshold=6.975e+02, percent-clipped=5.0 2023-06-19 13:48:13,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=356130.0, ans=0.2 2023-06-19 13:48:24,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356190.0, ans=0.125 2023-06-19 13:48:41,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2023-06-19 13:49:55,046 INFO [train.py:996] (0/4) Epoch 2, batch 28900, loss[loss=0.3248, simple_loss=0.3838, pruned_loss=0.1329, over 21664.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3479, pruned_loss=0.1159, over 4297295.26 frames. ], batch size: 414, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:51:16,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=356490.0, ans=0.125 2023-06-19 13:52:11,440 INFO [train.py:996] (0/4) Epoch 2, batch 28950, loss[loss=0.2909, simple_loss=0.3829, pruned_loss=0.09948, over 20717.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3486, pruned_loss=0.1143, over 4282793.53 frames. ], batch size: 607, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:52:42,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356730.0, ans=0.1 2023-06-19 13:53:05,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356730.0, ans=0.1 2023-06-19 13:53:08,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=356730.0, ans=0.035 2023-06-19 13:53:15,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.845e+02 3.297e+02 3.891e+02 9.356e+02, threshold=6.594e+02, percent-clipped=2.0 2023-06-19 13:53:23,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=356790.0, ans=0.0 2023-06-19 13:53:47,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=356850.0, ans=0.0 2023-06-19 13:53:49,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356850.0, ans=0.1 2023-06-19 13:53:59,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=356850.0, ans=0.125 2023-06-19 13:54:19,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=356910.0, ans=0.0 2023-06-19 13:54:22,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356910.0, ans=0.1 2023-06-19 13:54:36,139 INFO [train.py:996] (0/4) Epoch 2, batch 29000, loss[loss=0.2953, simple_loss=0.3655, pruned_loss=0.1126, over 21777.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3524, pruned_loss=0.1125, over 4280017.12 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:54:51,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=356970.0, ans=0.0 2023-06-19 13:56:45,621 INFO [train.py:996] (0/4) Epoch 2, batch 29050, loss[loss=0.278, simple_loss=0.3348, pruned_loss=0.1106, over 21341.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3512, pruned_loss=0.114, over 4284731.86 frames. ], batch size: 143, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 13:56:54,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=357270.0, ans=0.0 2023-06-19 13:57:22,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.800e+02 3.368e+02 3.834e+02 6.942e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-19 13:58:23,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=357450.0, ans=0.125 2023-06-19 13:58:56,888 INFO [train.py:996] (0/4) Epoch 2, batch 29100, loss[loss=0.2274, simple_loss=0.2845, pruned_loss=0.08512, over 21749.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3408, pruned_loss=0.1107, over 4275743.57 frames. ], batch size: 351, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:00:22,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.41 vs. limit=5.0 2023-06-19 14:00:24,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=357750.0, ans=0.04949747468305833 2023-06-19 14:00:26,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=357810.0, ans=0.125 2023-06-19 14:00:47,412 INFO [train.py:996] (0/4) Epoch 2, batch 29150, loss[loss=0.3245, simple_loss=0.3737, pruned_loss=0.1377, over 21386.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3372, pruned_loss=0.1075, over 4267850.52 frames. ], batch size: 471, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:01:25,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-19 14:01:29,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=357930.0, ans=0.125 2023-06-19 14:01:30,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.793e+02 3.362e+02 4.160e+02 6.908e+02, threshold=6.724e+02, percent-clipped=1.0 2023-06-19 14:01:34,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=357990.0, ans=0.125 2023-06-19 14:02:22,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=358110.0, ans=0.125 2023-06-19 14:02:38,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=358110.0, ans=0.05 2023-06-19 14:02:53,700 INFO [train.py:996] (0/4) Epoch 2, batch 29200, loss[loss=0.2232, simple_loss=0.2852, pruned_loss=0.08058, over 21398.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3328, pruned_loss=0.1064, over 4261539.56 frames. ], batch size: 131, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:03:08,726 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:03:29,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358290.0, ans=0.125 2023-06-19 14:04:01,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=358350.0, ans=0.125 2023-06-19 14:04:11,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358350.0, ans=0.125 2023-06-19 14:04:58,584 INFO [train.py:996] (0/4) Epoch 2, batch 29250, loss[loss=0.2472, simple_loss=0.3338, pruned_loss=0.08028, over 21752.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3315, pruned_loss=0.1033, over 4266477.49 frames. ], batch size: 282, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:05:00,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358470.0, ans=0.1 2023-06-19 14:05:23,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.478e+02 3.001e+02 3.725e+02 6.344e+02, threshold=6.002e+02, percent-clipped=0.0 2023-06-19 14:06:57,919 INFO [train.py:996] (0/4) Epoch 2, batch 29300, loss[loss=0.2956, simple_loss=0.3435, pruned_loss=0.1238, over 21542.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3327, pruned_loss=0.102, over 4266579.74 frames. ], batch size: 441, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:06:58,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358770.0, ans=0.1 2023-06-19 14:09:07,177 INFO [train.py:996] (0/4) Epoch 2, batch 29350, loss[loss=0.2404, simple_loss=0.2871, pruned_loss=0.09686, over 21147.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3271, pruned_loss=0.1017, over 4265087.02 frames. ], batch size: 159, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:09:11,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-19 14:09:21,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=359070.0, ans=0.125 2023-06-19 14:10:02,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.578e+02 2.975e+02 3.361e+02 4.111e+02, threshold=5.949e+02, percent-clipped=0.0 2023-06-19 14:10:27,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359250.0, ans=0.1 2023-06-19 14:10:29,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=359250.0, ans=0.0 2023-06-19 14:10:43,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=12.0 2023-06-19 14:10:43,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=359250.0, ans=0.125 2023-06-19 14:11:03,630 INFO [train.py:996] (0/4) Epoch 2, batch 29400, loss[loss=0.3024, simple_loss=0.3676, pruned_loss=0.1186, over 21468.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3275, pruned_loss=0.09951, over 4262656.46 frames. ], batch size: 509, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:11:09,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=359370.0, ans=0.2 2023-06-19 14:11:11,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=359370.0, ans=0.0 2023-06-19 14:11:28,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.74 vs. limit=6.0 2023-06-19 14:12:25,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-19 14:12:45,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=359550.0, ans=0.125 2023-06-19 14:12:59,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=359610.0, ans=0.125 2023-06-19 14:13:09,872 INFO [train.py:996] (0/4) Epoch 2, batch 29450, loss[loss=0.3039, simple_loss=0.361, pruned_loss=0.1234, over 21757.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3247, pruned_loss=0.09806, over 4270149.17 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:13:49,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=359730.0, ans=0.0 2023-06-19 14:13:52,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=359730.0, ans=0.125 2023-06-19 14:14:07,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=359730.0, ans=10.0 2023-06-19 14:14:07,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-19 14:14:13,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.887e+02 3.363e+02 4.208e+02 7.799e+02, threshold=6.726e+02, percent-clipped=7.0 2023-06-19 14:14:16,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359730.0, ans=0.1 2023-06-19 14:14:23,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=359790.0, ans=0.1 2023-06-19 14:14:52,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=359910.0, ans=0.0 2023-06-19 14:15:23,686 INFO [train.py:996] (0/4) Epoch 2, batch 29500, loss[loss=0.2492, simple_loss=0.301, pruned_loss=0.09869, over 21176.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3314, pruned_loss=0.1029, over 4269024.38 frames. ], batch size: 608, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:15:31,937 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-60000.pt 2023-06-19 14:16:30,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=360090.0, ans=0.0 2023-06-19 14:17:33,212 INFO [train.py:996] (0/4) Epoch 2, batch 29550, loss[loss=0.2694, simple_loss=0.3248, pruned_loss=0.107, over 21887.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3301, pruned_loss=0.1043, over 4273648.62 frames. ], batch size: 316, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 14:18:20,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.713e+02 3.208e+02 3.753e+02 5.993e+02, threshold=6.415e+02, percent-clipped=0.0 2023-06-19 14:19:37,169 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:19:52,643 INFO [train.py:996] (0/4) Epoch 2, batch 29600, loss[loss=0.2596, simple_loss=0.2983, pruned_loss=0.1105, over 20310.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3371, pruned_loss=0.1079, over 4280128.75 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:20:27,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=360630.0, ans=0.0 2023-06-19 14:20:44,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.30 vs. limit=6.0 2023-06-19 14:20:47,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=360690.0, ans=0.07 2023-06-19 14:20:59,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360690.0, ans=0.1 2023-06-19 14:21:10,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=360750.0, ans=0.125 2023-06-19 14:21:41,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=360810.0, ans=0.125 2023-06-19 14:22:07,138 INFO [train.py:996] (0/4) Epoch 2, batch 29650, loss[loss=0.2162, simple_loss=0.2818, pruned_loss=0.07532, over 21833.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.333, pruned_loss=0.1034, over 4282266.20 frames. ], batch size: 282, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:22:17,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=360870.0, ans=0.125 2023-06-19 14:22:28,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-19 14:22:43,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.621e+02 3.260e+02 3.981e+02 6.748e+02, threshold=6.520e+02, percent-clipped=1.0 2023-06-19 14:23:18,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=361050.0, ans=0.125 2023-06-19 14:23:18,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=361050.0, ans=0.125 2023-06-19 14:23:57,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=361110.0, ans=0.125 2023-06-19 14:24:19,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361170.0, ans=0.1 2023-06-19 14:24:20,064 INFO [train.py:996] (0/4) Epoch 2, batch 29700, loss[loss=0.3045, simple_loss=0.4015, pruned_loss=0.1037, over 21759.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3357, pruned_loss=0.1044, over 4283396.50 frames. ], batch size: 298, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:24:20,544 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:24:20,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361170.0, ans=0.1 2023-06-19 14:24:27,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-19 14:24:28,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-19 14:24:41,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-19 14:24:48,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=361230.0, ans=0.0 2023-06-19 14:25:06,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361290.0, ans=0.1 2023-06-19 14:25:09,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-19 14:26:01,216 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:26:05,131 INFO [train.py:996] (0/4) Epoch 2, batch 29750, loss[loss=0.2789, simple_loss=0.3648, pruned_loss=0.09653, over 21834.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3412, pruned_loss=0.1037, over 4288274.25 frames. ], batch size: 371, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:26:24,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=361530.0, ans=0.0 2023-06-19 14:26:37,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-19 14:26:40,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=361530.0, ans=15.0 2023-06-19 14:26:41,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.594e+02 3.244e+02 4.467e+02 8.629e+02, threshold=6.487e+02, percent-clipped=7.0 2023-06-19 14:26:43,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=361530.0, ans=0.0 2023-06-19 14:27:03,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-19 14:27:14,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-19 14:27:21,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361650.0, ans=0.1 2023-06-19 14:28:05,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=361710.0, ans=0.2 2023-06-19 14:28:12,796 INFO [train.py:996] (0/4) Epoch 2, batch 29800, loss[loss=0.289, simple_loss=0.3392, pruned_loss=0.1194, over 21893.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3424, pruned_loss=0.1043, over 4286306.59 frames. ], batch size: 414, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:28:32,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=361830.0, ans=0.5 2023-06-19 14:28:45,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=361830.0, ans=0.125 2023-06-19 14:30:01,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=362010.0, ans=0.125 2023-06-19 14:30:10,965 INFO [train.py:996] (0/4) Epoch 2, batch 29850, loss[loss=0.2931, simple_loss=0.3519, pruned_loss=0.1172, over 21919.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3382, pruned_loss=0.1022, over 4282323.59 frames. ], batch size: 107, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:30:12,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=362070.0, ans=0.125 2023-06-19 14:30:35,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=362070.0, ans=0.0 2023-06-19 14:30:36,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=362130.0, ans=0.125 2023-06-19 14:30:47,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.554e+02 3.049e+02 3.847e+02 7.515e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-19 14:32:22,164 INFO [train.py:996] (0/4) Epoch 2, batch 29900, loss[loss=0.3136, simple_loss=0.3699, pruned_loss=0.1286, over 21302.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3381, pruned_loss=0.1042, over 4282314.89 frames. ], batch size: 159, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 14:34:31,661 INFO [train.py:996] (0/4) Epoch 2, batch 29950, loss[loss=0.3012, simple_loss=0.3515, pruned_loss=0.1255, over 20628.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3423, pruned_loss=0.1083, over 4277940.06 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:34:38,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=362670.0, ans=0.0 2023-06-19 14:34:55,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=362730.0, ans=0.125 2023-06-19 14:35:09,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=362730.0, ans=0.125 2023-06-19 14:35:14,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.991e+02 3.617e+02 4.391e+02 6.676e+02, threshold=7.234e+02, percent-clipped=4.0 2023-06-19 14:35:57,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362850.0, ans=0.125 2023-06-19 14:36:20,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=362910.0, ans=0.125 2023-06-19 14:36:39,172 INFO [train.py:996] (0/4) Epoch 2, batch 30000, loss[loss=0.2756, simple_loss=0.3452, pruned_loss=0.103, over 21431.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3449, pruned_loss=0.1088, over 4281642.68 frames. ], batch size: 131, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:36:39,173 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 14:37:25,479 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2591, simple_loss=0.3611, pruned_loss=0.07848, over 1796401.00 frames. 2023-06-19 14:37:25,480 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 14:37:34,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=362970.0, ans=0.2 2023-06-19 14:38:18,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-19 14:38:22,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-19 14:39:20,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=363210.0, ans=0.2 2023-06-19 14:39:26,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=363210.0, ans=0.2 2023-06-19 14:39:47,271 INFO [train.py:996] (0/4) Epoch 2, batch 30050, loss[loss=0.3189, simple_loss=0.4086, pruned_loss=0.1146, over 21805.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3486, pruned_loss=0.1054, over 4284941.97 frames. ], batch size: 371, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:39:47,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=363270.0, ans=0.125 2023-06-19 14:40:26,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.570e+02 3.053e+02 3.907e+02 7.518e+02, threshold=6.106e+02, percent-clipped=1.0 2023-06-19 14:40:56,236 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:41:17,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=363510.0, ans=0.0 2023-06-19 14:41:23,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=363510.0, ans=0.0 2023-06-19 14:41:25,605 INFO [train.py:996] (0/4) Epoch 2, batch 30100, loss[loss=0.3169, simple_loss=0.3359, pruned_loss=0.149, over 21306.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3458, pruned_loss=0.1048, over 4281396.17 frames. ], batch size: 507, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:42:03,087 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:42:07,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=363630.0, ans=0.2 2023-06-19 14:42:22,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-19 14:42:46,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=363750.0, ans=0.0 2023-06-19 14:43:06,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=363810.0, ans=10.0 2023-06-19 14:43:27,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=363810.0, ans=0.07 2023-06-19 14:43:37,592 INFO [train.py:996] (0/4) Epoch 2, batch 30150, loss[loss=0.3102, simple_loss=0.3636, pruned_loss=0.1284, over 21563.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3403, pruned_loss=0.1059, over 4279117.07 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:44:32,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.807e+02 3.248e+02 3.802e+02 5.683e+02, threshold=6.495e+02, percent-clipped=0.0 2023-06-19 14:44:42,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363990.0, ans=0.1 2023-06-19 14:44:43,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363990.0, ans=0.125 2023-06-19 14:45:53,862 INFO [train.py:996] (0/4) Epoch 2, batch 30200, loss[loss=0.3207, simple_loss=0.3912, pruned_loss=0.1251, over 21444.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3439, pruned_loss=0.1052, over 4282002.51 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:48:15,229 INFO [train.py:996] (0/4) Epoch 2, batch 30250, loss[loss=0.2858, simple_loss=0.3641, pruned_loss=0.1038, over 21334.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3532, pruned_loss=0.1092, over 4282043.33 frames. ], batch size: 159, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:48:15,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=364470.0, ans=0.5 2023-06-19 14:48:18,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=364470.0, ans=0.09899494936611666 2023-06-19 14:48:22,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364470.0, ans=0.1 2023-06-19 14:48:53,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.937e+02 3.483e+02 4.410e+02 7.312e+02, threshold=6.966e+02, percent-clipped=2.0 2023-06-19 14:50:02,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=364710.0, ans=15.0 2023-06-19 14:50:12,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364770.0, ans=0.125 2023-06-19 14:50:12,981 INFO [train.py:996] (0/4) Epoch 2, batch 30300, loss[loss=0.2344, simple_loss=0.2863, pruned_loss=0.09122, over 21228.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3488, pruned_loss=0.1079, over 4284028.45 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:50:47,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=364830.0, ans=0.125 2023-06-19 14:52:44,514 INFO [train.py:996] (0/4) Epoch 2, batch 30350, loss[loss=0.2659, simple_loss=0.3317, pruned_loss=0.1, over 21726.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3472, pruned_loss=0.1083, over 4286909.13 frames. ], batch size: 298, lr: 1.44e-02, grad_scale: 16.0 2023-06-19 14:53:17,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=365130.0, ans=0.0 2023-06-19 14:53:24,914 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:53:34,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.858e+02 3.357e+02 4.811e+02 8.525e+02, threshold=6.714e+02, percent-clipped=9.0 2023-06-19 14:53:41,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-19 14:54:48,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=365310.0, ans=0.04949747468305833 2023-06-19 14:55:26,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=365310.0, ans=0.2 2023-06-19 14:55:31,010 INFO [train.py:996] (0/4) Epoch 2, batch 30400, loss[loss=0.2735, simple_loss=0.3021, pruned_loss=0.1225, over 20315.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3375, pruned_loss=0.105, over 4274673.63 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 14:58:17,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=365550.0, ans=0.125 2023-06-19 14:59:30,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=365610.0, ans=0.2 2023-06-19 14:59:51,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-19 15:00:10,527 INFO [train.py:996] (0/4) Epoch 2, batch 30450, loss[loss=0.3589, simple_loss=0.4584, pruned_loss=0.1297, over 19934.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.341, pruned_loss=0.107, over 4212203.23 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 15:00:12,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365670.0, ans=0.1 2023-06-19 15:00:25,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=365670.0, ans=0.0 2023-06-19 15:01:05,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=365730.0, ans=15.0 2023-06-19 15:01:46,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.753e+02 4.971e+02 7.783e+02 2.032e+03, threshold=9.942e+02, percent-clipped=30.0 2023-06-19 15:02:38,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=365850.0, ans=0.125 2023-06-19 15:03:02,434 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-2.pt 2023-06-19 15:05:22,953 INFO [train.py:996] (0/4) Epoch 3, batch 0, loss[loss=0.2964, simple_loss=0.3517, pruned_loss=0.1205, over 21772.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3517, pruned_loss=0.1205, over 21772.00 frames. ], batch size: 102, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:05:22,954 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 15:06:09,292 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2643, simple_loss=0.3711, pruned_loss=0.07872, over 1796401.00 frames. 2023-06-19 15:06:09,294 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 15:06:13,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=365934.0, ans=0.0 2023-06-19 15:06:31,238 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:06:45,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366054.0, ans=0.1 2023-06-19 15:07:04,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=366114.0, ans=0.125 2023-06-19 15:07:13,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=366114.0, ans=0.0 2023-06-19 15:07:39,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=366234.0, ans=0.2 2023-06-19 15:07:40,204 INFO [train.py:996] (0/4) Epoch 3, batch 50, loss[loss=0.2988, simple_loss=0.383, pruned_loss=0.1073, over 20691.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3514, pruned_loss=0.1059, over 961366.82 frames. ], batch size: 607, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:07:56,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-19 15:08:22,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=366294.0, ans=0.125 2023-06-19 15:08:45,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.896e+02 3.528e+02 5.821e+02 1.512e+03, threshold=7.056e+02, percent-clipped=7.0 2023-06-19 15:09:07,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=366414.0, ans=0.125 2023-06-19 15:09:08,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=366414.0, ans=0.035 2023-06-19 15:09:14,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.90 vs. limit=22.5 2023-06-19 15:09:46,274 INFO [train.py:996] (0/4) Epoch 3, batch 100, loss[loss=0.3535, simple_loss=0.4085, pruned_loss=0.1493, over 21481.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3592, pruned_loss=0.1069, over 1694312.29 frames. ], batch size: 471, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:10:22,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=366594.0, ans=0.2 2023-06-19 15:10:59,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-06-19 15:11:11,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=366774.0, ans=0.07 2023-06-19 15:11:19,716 INFO [train.py:996] (0/4) Epoch 3, batch 150, loss[loss=0.2631, simple_loss=0.3466, pruned_loss=0.08982, over 21787.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3558, pruned_loss=0.1051, over 2269200.96 frames. ], batch size: 371, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:12:12,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.572e+02 2.987e+02 3.801e+02 6.423e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-19 15:12:51,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=367074.0, ans=0.0 2023-06-19 15:13:16,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-19 15:13:25,862 INFO [train.py:996] (0/4) Epoch 3, batch 200, loss[loss=0.2756, simple_loss=0.3662, pruned_loss=0.09246, over 21770.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3544, pruned_loss=0.1037, over 2720744.33 frames. ], batch size: 332, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:14:14,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=367254.0, ans=0.0 2023-06-19 15:14:36,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=367314.0, ans=0.125 2023-06-19 15:14:39,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-19 15:15:08,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=367374.0, ans=0.0 2023-06-19 15:15:20,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=367434.0, ans=0.0 2023-06-19 15:15:25,185 INFO [train.py:996] (0/4) Epoch 3, batch 250, loss[loss=0.2839, simple_loss=0.3613, pruned_loss=0.1032, over 21635.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3514, pruned_loss=0.1034, over 3068506.57 frames. ], batch size: 414, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 15:15:28,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=367434.0, ans=0.125 2023-06-19 15:15:55,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=367494.0, ans=0.125 2023-06-19 15:16:20,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.820e+02 3.135e+02 3.949e+02 6.710e+02, threshold=6.270e+02, percent-clipped=4.0 2023-06-19 15:16:42,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-19 15:17:07,723 INFO [train.py:996] (0/4) Epoch 3, batch 300, loss[loss=0.2268, simple_loss=0.2842, pruned_loss=0.08468, over 21321.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3466, pruned_loss=0.1031, over 3336308.73 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:17:39,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=367734.0, ans=0.0 2023-06-19 15:17:41,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-19 15:18:03,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-19 15:18:04,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-19 15:18:06,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=367854.0, ans=0.2 2023-06-19 15:18:47,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=367914.0, ans=0.0 2023-06-19 15:19:27,145 INFO [train.py:996] (0/4) Epoch 3, batch 350, loss[loss=0.2594, simple_loss=0.313, pruned_loss=0.1029, over 21871.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3408, pruned_loss=0.1017, over 3538763.66 frames. ], batch size: 373, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:19:51,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=368094.0, ans=0.125 2023-06-19 15:19:52,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=368094.0, ans=0.125 2023-06-19 15:20:08,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=368154.0, ans=0.125 2023-06-19 15:20:27,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.616e+02 3.033e+02 3.600e+02 6.018e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:21:01,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=368214.0, ans=0.0 2023-06-19 15:21:06,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=368274.0, ans=0.0 2023-06-19 15:21:09,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=368274.0, ans=0.2 2023-06-19 15:21:16,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368274.0, ans=0.125 2023-06-19 15:21:20,089 INFO [train.py:996] (0/4) Epoch 3, batch 400, loss[loss=0.2461, simple_loss=0.311, pruned_loss=0.09056, over 21891.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3322, pruned_loss=0.09901, over 3699169.66 frames. ], batch size: 107, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:21:44,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=368334.0, ans=0.125 2023-06-19 15:22:20,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368394.0, ans=0.1 2023-06-19 15:22:46,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368454.0, ans=0.1 2023-06-19 15:23:06,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=368514.0, ans=0.125 2023-06-19 15:23:08,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=368514.0, ans=0.125 2023-06-19 15:23:44,686 INFO [train.py:996] (0/4) Epoch 3, batch 450, loss[loss=0.2363, simple_loss=0.3219, pruned_loss=0.07532, over 21575.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3296, pruned_loss=0.09775, over 3830260.29 frames. ], batch size: 389, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:23:54,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=368634.0, ans=0.125 2023-06-19 15:24:04,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=368634.0, ans=0.95 2023-06-19 15:24:53,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.675e+02 3.174e+02 4.141e+02 7.803e+02, threshold=6.347e+02, percent-clipped=3.0 2023-06-19 15:25:09,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-19 15:25:31,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=368814.0, ans=0.125 2023-06-19 15:25:59,100 INFO [train.py:996] (0/4) Epoch 3, batch 500, loss[loss=0.2555, simple_loss=0.3341, pruned_loss=0.08839, over 21243.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3319, pruned_loss=0.0959, over 3932750.89 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:26:52,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=368994.0, ans=0.0 2023-06-19 15:27:14,245 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:27:17,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=369054.0, ans=0.2 2023-06-19 15:27:22,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369114.0, ans=0.125 2023-06-19 15:27:34,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=369174.0, ans=0.0 2023-06-19 15:28:12,601 INFO [train.py:996] (0/4) Epoch 3, batch 550, loss[loss=0.3548, simple_loss=0.4401, pruned_loss=0.1348, over 21632.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3387, pruned_loss=0.0969, over 4012320.07 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:29:06,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.826e+02 3.270e+02 4.070e+02 7.651e+02, threshold=6.541e+02, percent-clipped=1.0 2023-06-19 15:29:34,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=369414.0, ans=0.125 2023-06-19 15:29:34,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=369414.0, ans=0.0 2023-06-19 15:29:51,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=369474.0, ans=0.125 2023-06-19 15:30:04,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-19 15:30:15,624 INFO [train.py:996] (0/4) Epoch 3, batch 600, loss[loss=0.283, simple_loss=0.3865, pruned_loss=0.08981, over 21784.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3414, pruned_loss=0.09789, over 4080855.23 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:30:21,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369534.0, ans=0.1 2023-06-19 15:30:37,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=369534.0, ans=0.125 2023-06-19 15:31:23,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=369654.0, ans=0.125 2023-06-19 15:31:25,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-19 15:31:28,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369654.0, ans=0.1 2023-06-19 15:31:42,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=369714.0, ans=0.0 2023-06-19 15:32:18,106 INFO [train.py:996] (0/4) Epoch 3, batch 650, loss[loss=0.294, simple_loss=0.4104, pruned_loss=0.08881, over 19863.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3402, pruned_loss=0.0978, over 4112130.58 frames. ], batch size: 702, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:32:25,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369834.0, ans=0.1 2023-06-19 15:32:37,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=369834.0, ans=0.125 2023-06-19 15:33:22,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.840e+02 3.848e+02 4.493e+02 8.755e+02, threshold=7.695e+02, percent-clipped=3.0 2023-06-19 15:33:23,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-19 15:33:44,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.99 vs. limit=10.0 2023-06-19 15:34:26,364 INFO [train.py:996] (0/4) Epoch 3, batch 700, loss[loss=0.2353, simple_loss=0.305, pruned_loss=0.08281, over 21867.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3419, pruned_loss=0.09917, over 4150607.39 frames. ], batch size: 98, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:34:38,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-19 15:34:57,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=370194.0, ans=0.125 2023-06-19 15:36:29,391 INFO [train.py:996] (0/4) Epoch 3, batch 750, loss[loss=0.2624, simple_loss=0.3256, pruned_loss=0.09957, over 21473.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3409, pruned_loss=0.09999, over 4185067.83 frames. ], batch size: 211, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:37:10,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=370494.0, ans=0.125 2023-06-19 15:37:13,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=370494.0, ans=0.0 2023-06-19 15:37:30,721 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:37:31,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.860e+02 3.172e+02 4.099e+02 8.438e+02, threshold=6.343e+02, percent-clipped=1.0 2023-06-19 15:37:37,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=370614.0, ans=0.0 2023-06-19 15:37:45,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=370614.0, ans=0.2 2023-06-19 15:38:37,261 INFO [train.py:996] (0/4) Epoch 3, batch 800, loss[loss=0.2834, simple_loss=0.3314, pruned_loss=0.1177, over 21260.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3373, pruned_loss=0.1002, over 4200069.16 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:39:03,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=370794.0, ans=15.0 2023-06-19 15:39:43,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=370854.0, ans=0.125 2023-06-19 15:40:14,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-19 15:40:36,937 INFO [train.py:996] (0/4) Epoch 3, batch 850, loss[loss=0.2537, simple_loss=0.3148, pruned_loss=0.0963, over 21538.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3339, pruned_loss=0.1003, over 4228226.35 frames. ], batch size: 212, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:41:01,015 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:41:20,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=8.0 2023-06-19 15:41:35,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.741e+02 3.074e+02 3.686e+02 5.946e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-19 15:41:35,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=371154.0, ans=0.0 2023-06-19 15:41:37,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=371154.0, ans=0.125 2023-06-19 15:41:45,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=371214.0, ans=0.1 2023-06-19 15:41:45,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=371214.0, ans=0.125 2023-06-19 15:41:48,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=371214.0, ans=0.0 2023-06-19 15:42:29,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371274.0, ans=0.1 2023-06-19 15:42:35,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371274.0, ans=0.125 2023-06-19 15:42:37,788 INFO [train.py:996] (0/4) Epoch 3, batch 900, loss[loss=0.2521, simple_loss=0.3125, pruned_loss=0.09586, over 21891.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3295, pruned_loss=0.09935, over 4247966.27 frames. ], batch size: 351, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:42:55,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=371334.0, ans=0.125 2023-06-19 15:43:39,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=371454.0, ans=0.0 2023-06-19 15:44:40,528 INFO [train.py:996] (0/4) Epoch 3, batch 950, loss[loss=0.2637, simple_loss=0.3103, pruned_loss=0.1086, over 21525.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3277, pruned_loss=0.09883, over 4252035.62 frames. ], batch size: 548, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:45:25,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=371694.0, ans=0.2 2023-06-19 15:45:32,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=371754.0, ans=0.125 2023-06-19 15:45:36,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.723e+02 3.079e+02 3.859e+02 5.682e+02, threshold=6.158e+02, percent-clipped=0.0 2023-06-19 15:46:46,790 INFO [train.py:996] (0/4) Epoch 3, batch 1000, loss[loss=0.2503, simple_loss=0.3182, pruned_loss=0.09119, over 21800.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3278, pruned_loss=0.09846, over 4261445.96 frames. ], batch size: 298, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:46:48,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=371934.0, ans=0.125 2023-06-19 15:47:02,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=371934.0, ans=0.125 2023-06-19 15:47:45,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372054.0, ans=0.1 2023-06-19 15:47:55,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=372054.0, ans=0.125 2023-06-19 15:47:57,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-19 15:48:10,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=372114.0, ans=0.125 2023-06-19 15:48:41,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-19 15:49:03,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=372174.0, ans=0.04949747468305833 2023-06-19 15:49:10,934 INFO [train.py:996] (0/4) Epoch 3, batch 1050, loss[loss=0.2518, simple_loss=0.3173, pruned_loss=0.09312, over 21453.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3272, pruned_loss=0.09748, over 4275967.34 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:49:15,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-19 15:49:31,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=22.5 2023-06-19 15:49:39,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372294.0, ans=0.1 2023-06-19 15:49:53,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.550e+02 3.026e+02 4.001e+02 6.814e+02, threshold=6.053e+02, percent-clipped=1.0 2023-06-19 15:51:06,129 INFO [train.py:996] (0/4) Epoch 3, batch 1100, loss[loss=0.1966, simple_loss=0.2679, pruned_loss=0.06268, over 16467.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3266, pruned_loss=0.09637, over 4275636.99 frames. ], batch size: 61, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:51:42,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-19 15:53:23,722 INFO [train.py:996] (0/4) Epoch 3, batch 1150, loss[loss=0.2652, simple_loss=0.3309, pruned_loss=0.09975, over 21743.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3277, pruned_loss=0.09728, over 4284835.20 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:54:36,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.547e+02 3.199e+02 3.726e+02 8.923e+02, threshold=6.397e+02, percent-clipped=6.0 2023-06-19 15:54:37,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=372954.0, ans=0.0 2023-06-19 15:54:43,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=373014.0, ans=0.0 2023-06-19 15:55:37,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.08 vs. limit=15.0 2023-06-19 15:55:37,477 INFO [train.py:996] (0/4) Epoch 3, batch 1200, loss[loss=0.2873, simple_loss=0.3474, pruned_loss=0.1136, over 21724.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3287, pruned_loss=0.09779, over 4283890.52 frames. ], batch size: 389, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:55:48,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=373134.0, ans=0.0 2023-06-19 15:55:50,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=373134.0, ans=0.2 2023-06-19 15:57:01,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=373314.0, ans=0.125 2023-06-19 15:57:21,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=373374.0, ans=0.125 2023-06-19 15:57:45,879 INFO [train.py:996] (0/4) Epoch 3, batch 1250, loss[loss=0.252, simple_loss=0.3227, pruned_loss=0.09071, over 21787.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3318, pruned_loss=0.09902, over 4290689.33 frames. ], batch size: 247, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 15:57:51,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-19 15:58:56,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.686e+02 3.168e+02 3.917e+02 6.417e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-19 15:59:09,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=373614.0, ans=0.125 2023-06-19 15:59:18,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-19 15:59:44,123 INFO [train.py:996] (0/4) Epoch 3, batch 1300, loss[loss=0.2654, simple_loss=0.3337, pruned_loss=0.09858, over 21498.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3324, pruned_loss=0.09911, over 4297857.39 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 16:00:10,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-19 16:00:53,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=373854.0, ans=0.0 2023-06-19 16:01:02,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=373914.0, ans=0.125 2023-06-19 16:01:23,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=373974.0, ans=0.125 2023-06-19 16:01:30,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-19 16:01:37,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=373974.0, ans=0.125 2023-06-19 16:01:52,304 INFO [train.py:996] (0/4) Epoch 3, batch 1350, loss[loss=0.2471, simple_loss=0.3142, pruned_loss=0.09002, over 21498.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3346, pruned_loss=0.1011, over 4297135.29 frames. ], batch size: 131, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:02:29,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=15.0 2023-06-19 16:02:33,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=374094.0, ans=0.0 2023-06-19 16:02:52,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.921e+02 3.498e+02 4.356e+02 8.229e+02, threshold=6.996e+02, percent-clipped=3.0 2023-06-19 16:03:48,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=374274.0, ans=0.125 2023-06-19 16:03:51,336 INFO [train.py:996] (0/4) Epoch 3, batch 1400, loss[loss=0.2588, simple_loss=0.317, pruned_loss=0.1004, over 20110.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3321, pruned_loss=0.1003, over 4293866.33 frames. ], batch size: 703, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:04:12,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=374394.0, ans=0.0 2023-06-19 16:05:23,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=374514.0, ans=0.125 2023-06-19 16:05:26,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=374514.0, ans=0.0 2023-06-19 16:05:54,410 INFO [train.py:996] (0/4) Epoch 3, batch 1450, loss[loss=0.2485, simple_loss=0.3237, pruned_loss=0.08672, over 21692.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3345, pruned_loss=0.1015, over 4295011.93 frames. ], batch size: 112, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:06:12,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=374634.0, ans=0.125 2023-06-19 16:06:55,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.676e+02 2.954e+02 3.782e+02 5.807e+02, threshold=5.909e+02, percent-clipped=0.0 2023-06-19 16:07:34,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-19 16:07:38,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=374874.0, ans=0.2 2023-06-19 16:08:02,897 INFO [train.py:996] (0/4) Epoch 3, batch 1500, loss[loss=0.3094, simple_loss=0.3684, pruned_loss=0.1252, over 21559.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3345, pruned_loss=0.1031, over 4298820.06 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:09:16,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-19 16:10:13,228 INFO [train.py:996] (0/4) Epoch 3, batch 1550, loss[loss=0.2751, simple_loss=0.3316, pruned_loss=0.1092, over 21767.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3326, pruned_loss=0.1023, over 4289278.82 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:11:10,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.586e+02 2.974e+02 3.700e+02 5.786e+02, threshold=5.949e+02, percent-clipped=0.0 2023-06-19 16:11:30,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=375414.0, ans=0.125 2023-06-19 16:11:34,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=375414.0, ans=0.2 2023-06-19 16:12:22,183 INFO [train.py:996] (0/4) Epoch 3, batch 1600, loss[loss=0.2363, simple_loss=0.3103, pruned_loss=0.08117, over 20026.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3305, pruned_loss=0.09996, over 4279293.73 frames. ], batch size: 702, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:12:42,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-19 16:13:28,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=375654.0, ans=0.125 2023-06-19 16:14:17,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375774.0, ans=0.1 2023-06-19 16:14:27,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=375774.0, ans=0.125 2023-06-19 16:14:36,776 INFO [train.py:996] (0/4) Epoch 3, batch 1650, loss[loss=0.2935, simple_loss=0.3651, pruned_loss=0.111, over 21931.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3291, pruned_loss=0.09868, over 4280632.84 frames. ], batch size: 372, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:14:46,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=375834.0, ans=0.2 2023-06-19 16:15:23,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=22.5 2023-06-19 16:15:44,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.739e+02 3.147e+02 3.944e+02 6.533e+02, threshold=6.293e+02, percent-clipped=5.0 2023-06-19 16:17:05,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=376074.0, ans=0.0 2023-06-19 16:17:09,618 INFO [train.py:996] (0/4) Epoch 3, batch 1700, loss[loss=0.2894, simple_loss=0.3631, pruned_loss=0.1078, over 21592.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3335, pruned_loss=0.09999, over 4281634.47 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:19:35,970 INFO [train.py:996] (0/4) Epoch 3, batch 1750, loss[loss=0.2871, simple_loss=0.381, pruned_loss=0.09656, over 21631.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3332, pruned_loss=0.09921, over 4275332.75 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:20:20,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=376554.0, ans=0.025 2023-06-19 16:20:46,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.780e+02 3.176e+02 3.729e+02 7.377e+02, threshold=6.353e+02, percent-clipped=1.0 2023-06-19 16:21:34,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=376674.0, ans=0.125 2023-06-19 16:21:59,651 INFO [train.py:996] (0/4) Epoch 3, batch 1800, loss[loss=0.2489, simple_loss=0.3305, pruned_loss=0.08369, over 21403.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.331, pruned_loss=0.09686, over 4275645.13 frames. ], batch size: 194, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:23:12,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=376854.0, ans=0.0 2023-06-19 16:23:37,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=376914.0, ans=0.125 2023-06-19 16:23:51,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=376974.0, ans=0.125 2023-06-19 16:23:53,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376974.0, ans=0.125 2023-06-19 16:24:08,972 INFO [train.py:996] (0/4) Epoch 3, batch 1850, loss[loss=0.2588, simple_loss=0.3494, pruned_loss=0.08407, over 21004.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3318, pruned_loss=0.09592, over 4274104.18 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:24:13,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377034.0, ans=0.125 2023-06-19 16:24:21,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=377034.0, ans=0.0 2023-06-19 16:24:52,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-19 16:25:27,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.637e+02 3.036e+02 3.803e+02 8.113e+02, threshold=6.071e+02, percent-clipped=1.0 2023-06-19 16:26:29,756 INFO [train.py:996] (0/4) Epoch 3, batch 1900, loss[loss=0.2361, simple_loss=0.3018, pruned_loss=0.08517, over 21647.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3323, pruned_loss=0.09601, over 4277596.67 frames. ], batch size: 247, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:26:53,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-19 16:27:37,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=377454.0, ans=0.125 2023-06-19 16:27:45,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=377514.0, ans=0.125 2023-06-19 16:28:23,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=377574.0, ans=0.2 2023-06-19 16:28:36,278 INFO [train.py:996] (0/4) Epoch 3, batch 1950, loss[loss=0.2177, simple_loss=0.2947, pruned_loss=0.07034, over 21489.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3293, pruned_loss=0.09603, over 4281629.06 frames. ], batch size: 212, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:29:11,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=377694.0, ans=0.125 2023-06-19 16:29:34,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=377694.0, ans=0.125 2023-06-19 16:29:56,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.787e+02 3.264e+02 3.675e+02 5.955e+02, threshold=6.529e+02, percent-clipped=0.0 2023-06-19 16:30:30,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=377874.0, ans=0.125 2023-06-19 16:30:51,318 INFO [train.py:996] (0/4) Epoch 3, batch 2000, loss[loss=0.2574, simple_loss=0.3151, pruned_loss=0.09985, over 20774.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3233, pruned_loss=0.09325, over 4272591.78 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:31:11,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2023-06-19 16:31:21,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=377994.0, ans=0.125 2023-06-19 16:32:06,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=378054.0, ans=0.125 2023-06-19 16:32:28,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=378114.0, ans=0.125 2023-06-19 16:32:58,793 INFO [train.py:996] (0/4) Epoch 3, batch 2050, loss[loss=0.268, simple_loss=0.3229, pruned_loss=0.1066, over 21929.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3257, pruned_loss=0.09391, over 4282904.55 frames. ], batch size: 316, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:33:07,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378234.0, ans=0.1 2023-06-19 16:34:07,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.758e+02 3.171e+02 3.876e+02 8.323e+02, threshold=6.343e+02, percent-clipped=2.0 2023-06-19 16:34:39,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=378474.0, ans=0.125 2023-06-19 16:34:54,472 INFO [train.py:996] (0/4) Epoch 3, batch 2100, loss[loss=0.2421, simple_loss=0.287, pruned_loss=0.0986, over 20243.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3296, pruned_loss=0.09683, over 4282253.98 frames. ], batch size: 703, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:35:23,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378594.0, ans=0.125 2023-06-19 16:35:56,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=378654.0, ans=0.125 2023-06-19 16:36:29,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=378714.0, ans=0.0 2023-06-19 16:37:17,378 INFO [train.py:996] (0/4) Epoch 3, batch 2150, loss[loss=0.2708, simple_loss=0.32, pruned_loss=0.1108, over 21611.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3285, pruned_loss=0.09802, over 4273516.48 frames. ], batch size: 415, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:37:23,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=378834.0, ans=0.125 2023-06-19 16:37:41,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=378894.0, ans=0.125 2023-06-19 16:38:23,709 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.947e+02 3.794e+02 4.834e+02 7.445e+02, threshold=7.587e+02, percent-clipped=4.0 2023-06-19 16:39:04,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379074.0, ans=0.0 2023-06-19 16:39:08,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=379074.0, ans=0.125 2023-06-19 16:39:09,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=379074.0, ans=0.0 2023-06-19 16:39:18,309 INFO [train.py:996] (0/4) Epoch 3, batch 2200, loss[loss=0.3044, simple_loss=0.389, pruned_loss=0.1099, over 21606.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3344, pruned_loss=0.09876, over 4275450.16 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 16:39:33,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-19 16:40:24,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379254.0, ans=0.1 2023-06-19 16:41:00,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379314.0, ans=0.125 2023-06-19 16:41:38,419 INFO [train.py:996] (0/4) Epoch 3, batch 2250, loss[loss=0.2779, simple_loss=0.3134, pruned_loss=0.1212, over 21406.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3307, pruned_loss=0.09647, over 4279180.74 frames. ], batch size: 475, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:41:40,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=379434.0, ans=0.2 2023-06-19 16:42:09,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=379494.0, ans=0.125 2023-06-19 16:42:36,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=379554.0, ans=0.125 2023-06-19 16:42:47,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.563e+02 3.119e+02 3.980e+02 5.506e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-19 16:43:06,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=379614.0, ans=0.0 2023-06-19 16:43:07,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=379614.0, ans=0.035 2023-06-19 16:43:46,264 INFO [train.py:996] (0/4) Epoch 3, batch 2300, loss[loss=0.2367, simple_loss=0.3003, pruned_loss=0.08651, over 21452.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3258, pruned_loss=0.09498, over 4282741.79 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:43:54,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=379734.0, ans=0.2 2023-06-19 16:44:04,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=379734.0, ans=0.125 2023-06-19 16:44:25,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-19 16:45:38,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=379974.0, ans=0.0 2023-06-19 16:45:41,091 INFO [train.py:996] (0/4) Epoch 3, batch 2350, loss[loss=0.2518, simple_loss=0.3102, pruned_loss=0.09666, over 21839.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3218, pruned_loss=0.09533, over 4284242.07 frames. ], batch size: 107, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 16:46:58,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.669e+02 3.076e+02 3.679e+02 5.519e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-19 16:47:34,108 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-19 16:48:11,839 INFO [train.py:996] (0/4) Epoch 3, batch 2400, loss[loss=0.2958, simple_loss=0.3505, pruned_loss=0.1205, over 21804.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3256, pruned_loss=0.09764, over 4279658.32 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:50:04,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=15.0 2023-06-19 16:50:08,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=380574.0, ans=0.125 2023-06-19 16:50:33,696 INFO [train.py:996] (0/4) Epoch 3, batch 2450, loss[loss=0.3089, simple_loss=0.3693, pruned_loss=0.1242, over 21296.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3322, pruned_loss=0.1006, over 4276983.03 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:50:56,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:50:58,830 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:51:32,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.697e+02 3.047e+02 3.544e+02 7.014e+02, threshold=6.094e+02, percent-clipped=1.0 2023-06-19 16:51:35,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=380814.0, ans=0.125 2023-06-19 16:51:36,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=380814.0, ans=0.035 2023-06-19 16:51:57,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-19 16:52:22,741 INFO [train.py:996] (0/4) Epoch 3, batch 2500, loss[loss=0.267, simple_loss=0.3256, pruned_loss=0.1042, over 21513.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3299, pruned_loss=0.09901, over 4262898.82 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:54:19,209 INFO [train.py:996] (0/4) Epoch 3, batch 2550, loss[loss=0.243, simple_loss=0.3016, pruned_loss=0.09223, over 21137.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3265, pruned_loss=0.09674, over 4265735.49 frames. ], batch size: 159, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:54:27,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-19 16:54:56,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=381294.0, ans=0.2 2023-06-19 16:55:07,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=381294.0, ans=0.09899494936611666 2023-06-19 16:55:17,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=381354.0, ans=0.035 2023-06-19 16:55:33,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.648e+02 3.147e+02 3.811e+02 6.835e+02, threshold=6.294e+02, percent-clipped=1.0 2023-06-19 16:56:01,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=381474.0, ans=0.125 2023-06-19 16:56:20,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-19 16:56:34,138 INFO [train.py:996] (0/4) Epoch 3, batch 2600, loss[loss=0.3018, simple_loss=0.3709, pruned_loss=0.1163, over 16794.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3296, pruned_loss=0.0998, over 4264180.15 frames. ], batch size: 60, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:57:30,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=381654.0, ans=0.04949747468305833 2023-06-19 16:57:44,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.58 vs. limit=15.0 2023-06-19 16:58:59,104 INFO [train.py:996] (0/4) Epoch 3, batch 2650, loss[loss=0.235, simple_loss=0.295, pruned_loss=0.08756, over 20821.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3302, pruned_loss=0.101, over 4274074.79 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 16:59:42,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=381894.0, ans=0.015 2023-06-19 16:59:55,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=381954.0, ans=0.0 2023-06-19 17:00:02,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.896e+02 3.459e+02 3.981e+02 6.985e+02, threshold=6.919e+02, percent-clipped=4.0 2023-06-19 17:00:05,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.74 vs. limit=6.0 2023-06-19 17:00:05,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=15.0 2023-06-19 17:01:09,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=382074.0, ans=0.125 2023-06-19 17:01:11,824 INFO [train.py:996] (0/4) Epoch 3, batch 2700, loss[loss=0.302, simple_loss=0.3629, pruned_loss=0.1206, over 21585.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3276, pruned_loss=0.09837, over 4268816.80 frames. ], batch size: 473, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:01:15,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-19 17:01:19,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=382134.0, ans=0.125 2023-06-19 17:02:23,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-19 17:03:20,416 INFO [train.py:996] (0/4) Epoch 3, batch 2750, loss[loss=0.2684, simple_loss=0.3546, pruned_loss=0.09109, over 21832.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.326, pruned_loss=0.09825, over 4276719.93 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:03:36,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=382434.0, ans=0.0 2023-06-19 17:03:38,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=382434.0, ans=0.0 2023-06-19 17:04:00,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=382494.0, ans=0.0 2023-06-19 17:04:09,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382554.0, ans=0.1 2023-06-19 17:04:21,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=382554.0, ans=0.125 2023-06-19 17:04:33,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.812e+02 3.458e+02 3.888e+02 7.269e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-19 17:05:02,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=382614.0, ans=0.125 2023-06-19 17:05:26,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.94 vs. limit=5.0 2023-06-19 17:05:43,916 INFO [train.py:996] (0/4) Epoch 3, batch 2800, loss[loss=0.3259, simple_loss=0.3994, pruned_loss=0.1262, over 21655.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.332, pruned_loss=0.1007, over 4279197.22 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:07:30,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=382974.0, ans=0.125 2023-06-19 17:07:47,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=382974.0, ans=0.0 2023-06-19 17:07:51,277 INFO [train.py:996] (0/4) Epoch 3, batch 2850, loss[loss=0.2312, simple_loss=0.295, pruned_loss=0.0837, over 21629.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3317, pruned_loss=0.1012, over 4278614.31 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:08:02,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-19 17:08:57,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.123e+02 2.952e+02 3.438e+02 4.041e+02 6.558e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 17:09:20,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=383214.0, ans=0.05 2023-06-19 17:09:46,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383274.0, ans=0.1 2023-06-19 17:09:48,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-19 17:09:56,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-19 17:10:03,354 INFO [train.py:996] (0/4) Epoch 3, batch 2900, loss[loss=0.2194, simple_loss=0.2647, pruned_loss=0.08705, over 20787.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3277, pruned_loss=0.1004, over 4284435.10 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:10:03,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383334.0, ans=0.1 2023-06-19 17:12:17,737 INFO [train.py:996] (0/4) Epoch 3, batch 2950, loss[loss=0.3438, simple_loss=0.4103, pruned_loss=0.1387, over 21594.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3294, pruned_loss=0.1006, over 4288563.99 frames. ], batch size: 508, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:13:27,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.666e+02 3.329e+02 3.985e+02 6.298e+02, threshold=6.658e+02, percent-clipped=0.0 2023-06-19 17:13:52,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=383814.0, ans=0.0 2023-06-19 17:14:27,047 INFO [train.py:996] (0/4) Epoch 3, batch 3000, loss[loss=0.3112, simple_loss=0.3722, pruned_loss=0.1251, over 21574.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3325, pruned_loss=0.1016, over 4289532.28 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:14:27,049 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 17:15:26,749 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2641, simple_loss=0.3582, pruned_loss=0.08497, over 1796401.00 frames. 2023-06-19 17:15:26,751 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 17:15:42,408 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-64000.pt 2023-06-19 17:16:17,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=384054.0, ans=0.0 2023-06-19 17:16:38,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384114.0, ans=0.1 2023-06-19 17:17:07,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=384114.0, ans=0.0 2023-06-19 17:17:32,336 INFO [train.py:996] (0/4) Epoch 3, batch 3050, loss[loss=0.2538, simple_loss=0.3345, pruned_loss=0.08653, over 21672.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3334, pruned_loss=0.09998, over 4291991.19 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:17:38,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=384234.0, ans=0.95 2023-06-19 17:17:55,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-19 17:18:35,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=384354.0, ans=0.125 2023-06-19 17:18:36,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.612e+02 3.068e+02 3.884e+02 6.954e+02, threshold=6.136e+02, percent-clipped=1.0 2023-06-19 17:18:44,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=384414.0, ans=0.0 2023-06-19 17:19:25,762 INFO [train.py:996] (0/4) Epoch 3, batch 3100, loss[loss=0.2411, simple_loss=0.3245, pruned_loss=0.07883, over 21678.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3323, pruned_loss=0.09814, over 4288546.95 frames. ], batch size: 298, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:20:16,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=384654.0, ans=0.125 2023-06-19 17:20:23,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=384654.0, ans=0.125 2023-06-19 17:20:42,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=384714.0, ans=0.125 2023-06-19 17:21:18,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=384774.0, ans=0.125 2023-06-19 17:21:23,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384774.0, ans=0.1 2023-06-19 17:21:32,977 INFO [train.py:996] (0/4) Epoch 3, batch 3150, loss[loss=0.2743, simple_loss=0.342, pruned_loss=0.1033, over 21522.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3378, pruned_loss=0.1007, over 4292172.85 frames. ], batch size: 230, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:22:53,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=384954.0, ans=0.125 2023-06-19 17:22:54,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.666e+02 3.183e+02 4.146e+02 7.472e+02, threshold=6.366e+02, percent-clipped=3.0 2023-06-19 17:23:05,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=385014.0, ans=0.0 2023-06-19 17:23:21,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 17:23:30,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=385074.0, ans=0.0 2023-06-19 17:24:00,456 INFO [train.py:996] (0/4) Epoch 3, batch 3200, loss[loss=0.2548, simple_loss=0.3317, pruned_loss=0.08896, over 21926.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3379, pruned_loss=0.1004, over 4285518.79 frames. ], batch size: 317, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:25:31,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=385314.0, ans=0.125 2023-06-19 17:25:54,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=385374.0, ans=0.125 2023-06-19 17:26:04,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=385374.0, ans=0.125 2023-06-19 17:26:14,500 INFO [train.py:996] (0/4) Epoch 3, batch 3250, loss[loss=0.2412, simple_loss=0.2985, pruned_loss=0.09197, over 21649.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3377, pruned_loss=0.1022, over 4289153.24 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:27:15,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.186e+02 3.700e+02 4.457e+02 5.967e+02, threshold=7.400e+02, percent-clipped=0.0 2023-06-19 17:27:36,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=385614.0, ans=0.125 2023-06-19 17:28:25,314 INFO [train.py:996] (0/4) Epoch 3, batch 3300, loss[loss=0.2586, simple_loss=0.3474, pruned_loss=0.08485, over 20845.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3348, pruned_loss=0.1012, over 4287020.13 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:29:52,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=385914.0, ans=0.125 2023-06-19 17:30:48,662 INFO [train.py:996] (0/4) Epoch 3, batch 3350, loss[loss=0.2571, simple_loss=0.3148, pruned_loss=0.09967, over 21818.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3362, pruned_loss=0.1004, over 4284984.27 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:32:02,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.950e+02 3.389e+02 4.289e+02 7.899e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-19 17:32:45,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-19 17:33:05,260 INFO [train.py:996] (0/4) Epoch 3, batch 3400, loss[loss=0.2524, simple_loss=0.3161, pruned_loss=0.09436, over 21672.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3353, pruned_loss=0.1007, over 4289762.12 frames. ], batch size: 298, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:33:18,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=8.0 2023-06-19 17:33:28,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=386394.0, ans=0.0 2023-06-19 17:33:57,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=386514.0, ans=0.04949747468305833 2023-06-19 17:34:20,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=386514.0, ans=0.125 2023-06-19 17:34:54,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=386574.0, ans=0.125 2023-06-19 17:35:07,128 INFO [train.py:996] (0/4) Epoch 3, batch 3450, loss[loss=0.2628, simple_loss=0.3124, pruned_loss=0.1066, over 21827.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.333, pruned_loss=0.1004, over 4286071.61 frames. ], batch size: 372, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 17:35:28,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=386694.0, ans=0.125 2023-06-19 17:36:23,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.906e+02 3.291e+02 4.087e+02 6.835e+02, threshold=6.581e+02, percent-clipped=1.0 2023-06-19 17:37:01,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-19 17:37:03,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=386874.0, ans=0.0 2023-06-19 17:37:11,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=386874.0, ans=0.0 2023-06-19 17:37:15,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=386934.0, ans=0.125 2023-06-19 17:37:16,556 INFO [train.py:996] (0/4) Epoch 3, batch 3500, loss[loss=0.2761, simple_loss=0.3459, pruned_loss=0.1031, over 21798.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3421, pruned_loss=0.1047, over 4287662.68 frames. ], batch size: 118, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:38:14,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-19 17:39:34,542 INFO [train.py:996] (0/4) Epoch 3, batch 3550, loss[loss=0.2332, simple_loss=0.2941, pruned_loss=0.08609, over 21380.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3423, pruned_loss=0.1052, over 4286975.32 frames. ], batch size: 211, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:40:10,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=387294.0, ans=0.0 2023-06-19 17:40:51,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 2.929e+02 3.306e+02 4.196e+02 6.685e+02, threshold=6.611e+02, percent-clipped=1.0 2023-06-19 17:41:52,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=387534.0, ans=0.125 2023-06-19 17:41:53,749 INFO [train.py:996] (0/4) Epoch 3, batch 3600, loss[loss=0.2312, simple_loss=0.3025, pruned_loss=0.07994, over 21235.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3382, pruned_loss=0.105, over 4284756.43 frames. ], batch size: 549, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:42:03,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=387534.0, ans=0.0 2023-06-19 17:43:41,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=387714.0, ans=0.0 2023-06-19 17:43:57,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-19 17:44:17,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-19 17:44:19,134 INFO [train.py:996] (0/4) Epoch 3, batch 3650, loss[loss=0.3152, simple_loss=0.388, pruned_loss=0.1213, over 21555.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3411, pruned_loss=0.1061, over 4278222.56 frames. ], batch size: 508, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:44:20,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=387834.0, ans=0.125 2023-06-19 17:45:06,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=387894.0, ans=0.0 2023-06-19 17:45:13,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-19 17:45:31,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.846e+02 3.301e+02 4.049e+02 6.625e+02, threshold=6.601e+02, percent-clipped=2.0 2023-06-19 17:45:48,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=15.0 2023-06-19 17:46:25,667 INFO [train.py:996] (0/4) Epoch 3, batch 3700, loss[loss=0.2828, simple_loss=0.3563, pruned_loss=0.1047, over 21427.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3392, pruned_loss=0.1043, over 4283187.04 frames. ], batch size: 548, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:47:53,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=388314.0, ans=22.5 2023-06-19 17:47:54,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=388314.0, ans=0.125 2023-06-19 17:47:59,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=388314.0, ans=0.125 2023-06-19 17:48:00,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-19 17:48:56,822 INFO [train.py:996] (0/4) Epoch 3, batch 3750, loss[loss=0.367, simple_loss=0.4133, pruned_loss=0.1603, over 21724.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3364, pruned_loss=0.1029, over 4286728.04 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:50:00,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=388554.0, ans=0.2 2023-06-19 17:50:07,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 3.161e+02 3.639e+02 4.340e+02 7.555e+02, threshold=7.277e+02, percent-clipped=2.0 2023-06-19 17:50:25,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=388614.0, ans=0.035 2023-06-19 17:50:56,786 INFO [train.py:996] (0/4) Epoch 3, batch 3800, loss[loss=0.3045, simple_loss=0.3593, pruned_loss=0.1248, over 21756.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.336, pruned_loss=0.1028, over 4286816.37 frames. ], batch size: 392, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:51:41,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-06-19 17:52:48,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=388974.0, ans=0.0 2023-06-19 17:53:02,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-19 17:53:09,364 INFO [train.py:996] (0/4) Epoch 3, batch 3850, loss[loss=0.2291, simple_loss=0.2844, pruned_loss=0.08696, over 21696.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3318, pruned_loss=0.1024, over 4282414.02 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:53:11,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389034.0, ans=0.1 2023-06-19 17:53:12,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=389034.0, ans=0.125 2023-06-19 17:53:14,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389034.0, ans=0.1 2023-06-19 17:53:36,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=389094.0, ans=0.125 2023-06-19 17:54:10,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=389154.0, ans=0.125 2023-06-19 17:54:11,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=389154.0, ans=0.2 2023-06-19 17:54:12,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.931e+02 3.555e+02 4.316e+02 7.141e+02, threshold=7.110e+02, percent-clipped=0.0 2023-06-19 17:54:35,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=389214.0, ans=0.125 2023-06-19 17:54:47,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=389214.0, ans=0.035 2023-06-19 17:55:21,648 INFO [train.py:996] (0/4) Epoch 3, batch 3900, loss[loss=0.3164, simple_loss=0.3478, pruned_loss=0.1425, over 21766.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3276, pruned_loss=0.1012, over 4278813.21 frames. ], batch size: 508, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:55:48,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-19 17:56:42,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=389514.0, ans=0.125 2023-06-19 17:56:43,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=389514.0, ans=0.2 2023-06-19 17:57:00,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389514.0, ans=0.1 2023-06-19 17:57:08,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=389574.0, ans=0.0 2023-06-19 17:57:32,063 INFO [train.py:996] (0/4) Epoch 3, batch 3950, loss[loss=0.1861, simple_loss=0.2702, pruned_loss=0.05105, over 21559.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3265, pruned_loss=0.09917, over 4273867.73 frames. ], batch size: 230, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 17:57:55,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=389634.0, ans=0.2 2023-06-19 17:58:36,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=389754.0, ans=0.125 2023-06-19 17:58:47,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.533e+02 2.923e+02 3.801e+02 7.027e+02, threshold=5.846e+02, percent-clipped=0.0 2023-06-19 17:58:52,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=389814.0, ans=0.0 2023-06-19 17:59:08,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389814.0, ans=0.125 2023-06-19 17:59:41,386 INFO [train.py:996] (0/4) Epoch 3, batch 4000, loss[loss=0.2185, simple_loss=0.2752, pruned_loss=0.08087, over 21206.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3187, pruned_loss=0.09531, over 4276442.17 frames. ], batch size: 159, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:01:06,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=390054.0, ans=0.0 2023-06-19 18:01:11,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=390114.0, ans=0.2 2023-06-19 18:01:28,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390174.0, ans=0.1 2023-06-19 18:01:33,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=390174.0, ans=0.025 2023-06-19 18:01:34,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=390174.0, ans=0.125 2023-06-19 18:01:35,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=390174.0, ans=0.125 2023-06-19 18:01:44,574 INFO [train.py:996] (0/4) Epoch 3, batch 4050, loss[loss=0.2075, simple_loss=0.2885, pruned_loss=0.06329, over 21417.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3191, pruned_loss=0.09338, over 4276604.30 frames. ], batch size: 194, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:02:43,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=390294.0, ans=0.125 2023-06-19 18:03:01,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390354.0, ans=0.125 2023-06-19 18:03:07,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.511e+02 2.852e+02 3.605e+02 5.912e+02, threshold=5.705e+02, percent-clipped=1.0 2023-06-19 18:03:52,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=390474.0, ans=0.2 2023-06-19 18:03:59,019 INFO [train.py:996] (0/4) Epoch 3, batch 4100, loss[loss=0.2657, simple_loss=0.3639, pruned_loss=0.08376, over 19914.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3221, pruned_loss=0.0946, over 4286174.71 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:04:15,139 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:04:43,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=390594.0, ans=0.2 2023-06-19 18:04:44,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=390594.0, ans=0.2 2023-06-19 18:04:49,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-19 18:05:37,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390714.0, ans=0.125 2023-06-19 18:05:54,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-19 18:06:25,553 INFO [train.py:996] (0/4) Epoch 3, batch 4150, loss[loss=0.2993, simple_loss=0.3508, pruned_loss=0.1239, over 21467.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3225, pruned_loss=0.0919, over 4275849.78 frames. ], batch size: 509, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:06:25,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390834.0, ans=0.125 2023-06-19 18:06:39,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-06-19 18:07:18,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-19 18:07:30,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.414e+02 2.926e+02 3.479e+02 5.759e+02, threshold=5.851e+02, percent-clipped=1.0 2023-06-19 18:08:37,044 INFO [train.py:996] (0/4) Epoch 3, batch 4200, loss[loss=0.3756, simple_loss=0.4533, pruned_loss=0.1489, over 21511.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3234, pruned_loss=0.09244, over 4274342.97 frames. ], batch size: 471, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:09:16,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=391194.0, ans=0.0 2023-06-19 18:10:19,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=391314.0, ans=0.2 2023-06-19 18:10:25,554 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:10:51,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-19 18:10:59,595 INFO [train.py:996] (0/4) Epoch 3, batch 4250, loss[loss=0.3104, simple_loss=0.3912, pruned_loss=0.1148, over 21595.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3282, pruned_loss=0.09379, over 4270393.91 frames. ], batch size: 414, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:11:00,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=391434.0, ans=0.125 2023-06-19 18:11:04,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=391434.0, ans=0.125 2023-06-19 18:11:09,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=391434.0, ans=0.2 2023-06-19 18:11:10,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=391434.0, ans=0.0 2023-06-19 18:11:11,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-19 18:11:22,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391494.0, ans=0.0 2023-06-19 18:11:32,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-19 18:12:15,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.056e+02 3.919e+02 5.866e+02 1.121e+03, threshold=7.838e+02, percent-clipped=25.0 2023-06-19 18:12:35,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=391614.0, ans=0.125 2023-06-19 18:13:07,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=391674.0, ans=0.2 2023-06-19 18:13:09,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=391674.0, ans=0.2 2023-06-19 18:13:13,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=391674.0, ans=0.2 2023-06-19 18:13:22,653 INFO [train.py:996] (0/4) Epoch 3, batch 4300, loss[loss=0.3279, simple_loss=0.4182, pruned_loss=0.1188, over 21628.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3337, pruned_loss=0.09659, over 4267448.69 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:14:05,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391794.0, ans=0.125 2023-06-19 18:14:08,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=391794.0, ans=0.125 2023-06-19 18:14:12,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=391854.0, ans=0.2 2023-06-19 18:14:32,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-19 18:14:51,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=391914.0, ans=0.125 2023-06-19 18:15:34,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-19 18:15:38,843 INFO [train.py:996] (0/4) Epoch 3, batch 4350, loss[loss=0.2228, simple_loss=0.2817, pruned_loss=0.0819, over 21504.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3319, pruned_loss=0.09599, over 4268410.06 frames. ], batch size: 212, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:15:39,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=392034.0, ans=0.125 2023-06-19 18:16:22,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392094.0, ans=0.1 2023-06-19 18:16:56,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.727e+02 3.088e+02 4.174e+02 7.759e+02, threshold=6.176e+02, percent-clipped=0.0 2023-06-19 18:17:33,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-19 18:17:39,790 INFO [train.py:996] (0/4) Epoch 3, batch 4400, loss[loss=0.2779, simple_loss=0.3314, pruned_loss=0.1122, over 21489.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3288, pruned_loss=0.09547, over 4270744.47 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:19:57,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=392574.0, ans=0.125 2023-06-19 18:20:02,749 INFO [train.py:996] (0/4) Epoch 3, batch 4450, loss[loss=0.3019, simple_loss=0.3842, pruned_loss=0.1098, over 21654.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3377, pruned_loss=0.09711, over 4275453.70 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:20:08,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=15.0 2023-06-19 18:20:35,197 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:20:36,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=392694.0, ans=0.05 2023-06-19 18:20:39,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=392694.0, ans=0.125 2023-06-19 18:21:17,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392754.0, ans=0.1 2023-06-19 18:21:24,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 2.794e+02 3.143e+02 3.836e+02 6.823e+02, threshold=6.286e+02, percent-clipped=3.0 2023-06-19 18:21:49,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=392874.0, ans=15.0 2023-06-19 18:21:53,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=392874.0, ans=0.015 2023-06-19 18:22:00,531 INFO [train.py:996] (0/4) Epoch 3, batch 4500, loss[loss=0.2529, simple_loss=0.3211, pruned_loss=0.09235, over 21243.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3388, pruned_loss=0.09918, over 4274820.36 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:22:22,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=392934.0, ans=0.02 2023-06-19 18:22:24,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-19 18:23:12,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=393054.0, ans=0.0 2023-06-19 18:23:23,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=393054.0, ans=0.0 2023-06-19 18:23:35,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=393114.0, ans=0.125 2023-06-19 18:23:38,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=393114.0, ans=0.2 2023-06-19 18:24:34,827 INFO [train.py:996] (0/4) Epoch 3, batch 4550, loss[loss=0.3471, simple_loss=0.4013, pruned_loss=0.1465, over 21433.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3445, pruned_loss=0.1004, over 4272016.43 frames. ], batch size: 471, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 18:24:56,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=393234.0, ans=0.0 2023-06-19 18:25:17,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-19 18:25:53,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.924e+02 3.565e+02 4.370e+02 6.839e+02, threshold=7.130e+02, percent-clipped=5.0 2023-06-19 18:26:01,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393414.0, ans=0.1 2023-06-19 18:26:57,124 INFO [train.py:996] (0/4) Epoch 3, batch 4600, loss[loss=0.3042, simple_loss=0.4022, pruned_loss=0.1031, over 21196.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3462, pruned_loss=0.1017, over 4276554.21 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:27:02,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=393534.0, ans=0.0 2023-06-19 18:27:38,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-19 18:27:49,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393654.0, ans=0.125 2023-06-19 18:28:53,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-19 18:29:14,433 INFO [train.py:996] (0/4) Epoch 3, batch 4650, loss[loss=0.2192, simple_loss=0.2908, pruned_loss=0.07386, over 21278.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3393, pruned_loss=0.1001, over 4281552.95 frames. ], batch size: 143, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:29:16,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=393834.0, ans=0.0 2023-06-19 18:29:40,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=393894.0, ans=0.125 2023-06-19 18:29:45,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=393894.0, ans=0.125 2023-06-19 18:30:08,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=393954.0, ans=0.035 2023-06-19 18:30:21,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.410e+02 2.813e+02 3.483e+02 6.632e+02, threshold=5.627e+02, percent-clipped=0.0 2023-06-19 18:30:50,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-19 18:31:02,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-19 18:31:17,313 INFO [train.py:996] (0/4) Epoch 3, batch 4700, loss[loss=0.2197, simple_loss=0.2774, pruned_loss=0.08096, over 21471.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.329, pruned_loss=0.09693, over 4279608.89 frames. ], batch size: 212, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:31:36,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=394134.0, ans=0.2 2023-06-19 18:32:27,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394254.0, ans=0.1 2023-06-19 18:33:35,279 INFO [train.py:996] (0/4) Epoch 3, batch 4750, loss[loss=0.234, simple_loss=0.2961, pruned_loss=0.08599, over 21678.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3236, pruned_loss=0.09636, over 4282514.60 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:33:52,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=394434.0, ans=0.125 2023-06-19 18:34:05,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394494.0, ans=0.125 2023-06-19 18:34:08,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=394494.0, ans=0.1 2023-06-19 18:34:36,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=394554.0, ans=0.05 2023-06-19 18:34:40,261 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.729e+02 3.140e+02 4.884e+02 6.960e+02, threshold=6.280e+02, percent-clipped=14.0 2023-06-19 18:35:53,987 INFO [train.py:996] (0/4) Epoch 3, batch 4800, loss[loss=0.2667, simple_loss=0.3515, pruned_loss=0.09099, over 21800.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3246, pruned_loss=0.0971, over 4290275.45 frames. ], batch size: 371, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:36:18,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=394794.0, ans=0.125 2023-06-19 18:36:54,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=394914.0, ans=0.125 2023-06-19 18:37:50,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=394974.0, ans=0.125 2023-06-19 18:37:56,649 INFO [train.py:996] (0/4) Epoch 3, batch 4850, loss[loss=0.2454, simple_loss=0.309, pruned_loss=0.0909, over 21878.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3221, pruned_loss=0.09631, over 4291827.25 frames. ], batch size: 118, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:38:26,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=395094.0, ans=0.125 2023-06-19 18:38:30,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=395094.0, ans=0.125 2023-06-19 18:38:42,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=395154.0, ans=0.125 2023-06-19 18:38:57,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395154.0, ans=0.1 2023-06-19 18:39:00,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.909e+02 3.496e+02 4.405e+02 5.702e+02, threshold=6.991e+02, percent-clipped=0.0 2023-06-19 18:40:00,488 INFO [train.py:996] (0/4) Epoch 3, batch 4900, loss[loss=0.2709, simple_loss=0.3443, pruned_loss=0.09879, over 21591.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3237, pruned_loss=0.09681, over 4282453.83 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:40:22,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-19 18:40:52,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=395394.0, ans=0.125 2023-06-19 18:41:11,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=395514.0, ans=0.05 2023-06-19 18:41:35,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=395514.0, ans=0.025 2023-06-19 18:42:09,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=395574.0, ans=0.125 2023-06-19 18:42:21,345 INFO [train.py:996] (0/4) Epoch 3, batch 4950, loss[loss=0.2012, simple_loss=0.2847, pruned_loss=0.05886, over 21445.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3264, pruned_loss=0.09453, over 4288176.45 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:42:49,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=395694.0, ans=0.125 2023-06-19 18:43:27,981 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:43:36,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.438e+02 2.884e+02 3.463e+02 6.412e+02, threshold=5.767e+02, percent-clipped=0.0 2023-06-19 18:44:02,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-19 18:44:06,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=395874.0, ans=0.125 2023-06-19 18:44:30,721 INFO [train.py:996] (0/4) Epoch 3, batch 5000, loss[loss=0.2605, simple_loss=0.3278, pruned_loss=0.09663, over 21892.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3255, pruned_loss=0.09109, over 4293517.66 frames. ], batch size: 316, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:45:19,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-19 18:45:49,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=396114.0, ans=0.04949747468305833 2023-06-19 18:46:06,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=396174.0, ans=0.2 2023-06-19 18:46:30,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-06-19 18:46:39,105 INFO [train.py:996] (0/4) Epoch 3, batch 5050, loss[loss=0.2657, simple_loss=0.3492, pruned_loss=0.09115, over 21632.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.327, pruned_loss=0.09331, over 4298810.44 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:46:41,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=396234.0, ans=0.125 2023-06-19 18:46:55,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396234.0, ans=0.125 2023-06-19 18:47:01,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396294.0, ans=0.1 2023-06-19 18:47:46,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.997e+02 3.476e+02 4.224e+02 8.088e+02, threshold=6.952e+02, percent-clipped=5.0 2023-06-19 18:48:23,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=396414.0, ans=0.05 2023-06-19 18:48:55,176 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:49:06,028 INFO [train.py:996] (0/4) Epoch 3, batch 5100, loss[loss=0.26, simple_loss=0.3253, pruned_loss=0.0973, over 21721.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3272, pruned_loss=0.09446, over 4302014.93 frames. ], batch size: 389, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:49:08,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=396534.0, ans=0.125 2023-06-19 18:49:59,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-19 18:51:07,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=22.5 2023-06-19 18:51:09,500 INFO [train.py:996] (0/4) Epoch 3, batch 5150, loss[loss=0.2655, simple_loss=0.3228, pruned_loss=0.1041, over 21768.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3264, pruned_loss=0.09599, over 4302787.04 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:51:40,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-19 18:51:41,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-19 18:51:47,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=396894.0, ans=0.125 2023-06-19 18:52:00,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=396954.0, ans=0.125 2023-06-19 18:52:27,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.779e+02 3.148e+02 3.948e+02 7.558e+02, threshold=6.295e+02, percent-clipped=3.0 2023-06-19 18:52:44,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=397074.0, ans=0.125 2023-06-19 18:53:24,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=397074.0, ans=0.125 2023-06-19 18:53:36,798 INFO [train.py:996] (0/4) Epoch 3, batch 5200, loss[loss=0.2568, simple_loss=0.3425, pruned_loss=0.08556, over 21766.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3281, pruned_loss=0.09732, over 4300939.89 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:53:37,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=397134.0, ans=0.05 2023-06-19 18:53:45,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=397134.0, ans=0.2 2023-06-19 18:54:04,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=397194.0, ans=0.5 2023-06-19 18:54:13,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.71 vs. limit=22.5 2023-06-19 18:55:30,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397374.0, ans=0.125 2023-06-19 18:55:53,388 INFO [train.py:996] (0/4) Epoch 3, batch 5250, loss[loss=0.2234, simple_loss=0.2934, pruned_loss=0.0767, over 21761.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3306, pruned_loss=0.0957, over 4297317.08 frames. ], batch size: 112, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 18:56:20,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=397494.0, ans=0.125 2023-06-19 18:56:40,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=397554.0, ans=0.125 2023-06-19 18:56:53,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.637e+02 3.155e+02 4.003e+02 7.471e+02, threshold=6.309e+02, percent-clipped=2.0 2023-06-19 18:57:38,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=397674.0, ans=0.0 2023-06-19 18:57:47,862 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:57:56,747 INFO [train.py:996] (0/4) Epoch 3, batch 5300, loss[loss=0.2348, simple_loss=0.3067, pruned_loss=0.08138, over 21592.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3305, pruned_loss=0.09558, over 4296755.94 frames. ], batch size: 263, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 18:57:58,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=397734.0, ans=0.125 2023-06-19 18:58:26,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=397794.0, ans=0.95 2023-06-19 18:58:44,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=397794.0, ans=0.125 2023-06-19 18:59:00,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-19 18:59:57,891 INFO [train.py:996] (0/4) Epoch 3, batch 5350, loss[loss=0.3278, simple_loss=0.452, pruned_loss=0.1018, over 19774.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3297, pruned_loss=0.09701, over 4300979.35 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:00:25,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398094.0, ans=0.125 2023-06-19 19:00:39,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=398094.0, ans=0.05 2023-06-19 19:01:15,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.703e+02 3.245e+02 4.023e+02 6.387e+02, threshold=6.490e+02, percent-clipped=1.0 2023-06-19 19:01:50,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=398214.0, ans=0.0 2023-06-19 19:02:24,843 INFO [train.py:996] (0/4) Epoch 3, batch 5400, loss[loss=0.2807, simple_loss=0.3389, pruned_loss=0.1113, over 19985.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.33, pruned_loss=0.09834, over 4295702.12 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:03:16,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=398454.0, ans=10.0 2023-06-19 19:04:42,267 INFO [train.py:996] (0/4) Epoch 3, batch 5450, loss[loss=0.2276, simple_loss=0.3114, pruned_loss=0.07192, over 21384.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3293, pruned_loss=0.09615, over 4296117.88 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:04:48,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=398634.0, ans=0.125 2023-06-19 19:05:08,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=398694.0, ans=0.125 2023-06-19 19:06:03,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.356e+02 2.919e+02 3.477e+02 6.016e+02, threshold=5.839e+02, percent-clipped=0.0 2023-06-19 19:06:06,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=398814.0, ans=0.125 2023-06-19 19:06:49,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=398874.0, ans=0.2 2023-06-19 19:06:54,313 INFO [train.py:996] (0/4) Epoch 3, batch 5500, loss[loss=0.2438, simple_loss=0.345, pruned_loss=0.07132, over 20983.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3336, pruned_loss=0.09317, over 4294210.84 frames. ], batch size: 607, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:08:17,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399114.0, ans=0.1 2023-06-19 19:08:41,590 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:09:05,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=399174.0, ans=0.125 2023-06-19 19:09:12,512 INFO [train.py:996] (0/4) Epoch 3, batch 5550, loss[loss=0.2402, simple_loss=0.3295, pruned_loss=0.07546, over 21642.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3295, pruned_loss=0.08892, over 4293675.34 frames. ], batch size: 414, lr: 1.17e-02, grad_scale: 16.0 2023-06-19 19:10:01,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=399294.0, ans=0.2 2023-06-19 19:10:24,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=399354.0, ans=0.5 2023-06-19 19:10:42,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 2.351e+02 2.803e+02 3.299e+02 6.466e+02, threshold=5.606e+02, percent-clipped=1.0 2023-06-19 19:11:52,834 INFO [train.py:996] (0/4) Epoch 3, batch 5600, loss[loss=0.3953, simple_loss=0.45, pruned_loss=0.1703, over 21442.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3302, pruned_loss=0.08882, over 4291822.22 frames. ], batch size: 507, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:13:47,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=399774.0, ans=0.125 2023-06-19 19:14:09,713 INFO [train.py:996] (0/4) Epoch 3, batch 5650, loss[loss=0.2426, simple_loss=0.3066, pruned_loss=0.08928, over 21535.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3341, pruned_loss=0.09068, over 4290639.01 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:14:14,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=399834.0, ans=0.2 2023-06-19 19:15:28,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.997e+02 3.705e+02 7.555e+02, threshold=5.994e+02, percent-clipped=4.0 2023-06-19 19:16:05,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400074.0, ans=0.1 2023-06-19 19:16:35,363 INFO [train.py:996] (0/4) Epoch 3, batch 5700, loss[loss=0.2296, simple_loss=0.3125, pruned_loss=0.07335, over 21596.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3335, pruned_loss=0.09255, over 4290556.95 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 19:17:27,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=400194.0, ans=0.125 2023-06-19 19:18:02,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=400314.0, ans=0.125 2023-06-19 19:18:21,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=400314.0, ans=0.125 2023-06-19 19:18:46,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=400374.0, ans=0.0 2023-06-19 19:18:47,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400374.0, ans=0.1 2023-06-19 19:18:58,563 INFO [train.py:996] (0/4) Epoch 3, batch 5750, loss[loss=0.218, simple_loss=0.2956, pruned_loss=0.07021, over 21024.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3277, pruned_loss=0.08895, over 4278723.97 frames. ], batch size: 608, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:18:58,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=400434.0, ans=0.125 2023-06-19 19:19:38,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=400494.0, ans=0.125 2023-06-19 19:20:01,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400554.0, ans=0.1 2023-06-19 19:20:11,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.423e+02 2.924e+02 3.462e+02 7.613e+02, threshold=5.849e+02, percent-clipped=6.0 2023-06-19 19:21:04,747 INFO [train.py:996] (0/4) Epoch 3, batch 5800, loss[loss=0.2232, simple_loss=0.3086, pruned_loss=0.0689, over 21283.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3251, pruned_loss=0.08656, over 4279127.68 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:21:50,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=400794.0, ans=0.125 2023-06-19 19:22:02,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=400794.0, ans=0.125 2023-06-19 19:22:24,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=400854.0, ans=0.125 2023-06-19 19:22:36,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=400914.0, ans=0.125 2023-06-19 19:23:33,787 INFO [train.py:996] (0/4) Epoch 3, batch 5850, loss[loss=0.2428, simple_loss=0.3377, pruned_loss=0.07391, over 21470.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3227, pruned_loss=0.08197, over 4282945.09 frames. ], batch size: 471, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:24:22,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-19 19:25:07,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.976e+02 2.345e+02 2.902e+02 5.016e+02, threshold=4.690e+02, percent-clipped=0.0 2023-06-19 19:25:16,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401214.0, ans=0.1 2023-06-19 19:25:42,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-19 19:25:43,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=401274.0, ans=0.125 2023-06-19 19:25:47,460 INFO [train.py:996] (0/4) Epoch 3, batch 5900, loss[loss=0.1645, simple_loss=0.2589, pruned_loss=0.03511, over 21713.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3143, pruned_loss=0.07557, over 4288223.26 frames. ], batch size: 298, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:25:48,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-19 19:27:11,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=401514.0, ans=0.125 2023-06-19 19:27:20,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=401514.0, ans=0.0 2023-06-19 19:27:22,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-19 19:28:00,317 INFO [train.py:996] (0/4) Epoch 3, batch 5950, loss[loss=0.277, simple_loss=0.313, pruned_loss=0.1205, over 20142.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3161, pruned_loss=0.08027, over 4285738.67 frames. ], batch size: 703, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:28:38,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=401694.0, ans=0.0 2023-06-19 19:28:50,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.38 vs. limit=10.0 2023-06-19 19:29:06,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 2.649e+02 3.303e+02 4.502e+02 8.568e+02, threshold=6.607e+02, percent-clipped=21.0 2023-06-19 19:30:02,812 INFO [train.py:996] (0/4) Epoch 3, batch 6000, loss[loss=0.2067, simple_loss=0.3175, pruned_loss=0.048, over 21241.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.313, pruned_loss=0.08359, over 4264453.07 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:30:02,813 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 19:30:52,516 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2725, simple_loss=0.3668, pruned_loss=0.0891, over 1796401.00 frames. 2023-06-19 19:30:52,516 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 19:30:53,088 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:31:15,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=401994.0, ans=0.125 2023-06-19 19:31:29,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401994.0, ans=0.1 2023-06-19 19:31:52,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=402114.0, ans=15.0 2023-06-19 19:32:09,610 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:32:11,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=402114.0, ans=0.125 2023-06-19 19:32:45,298 INFO [train.py:996] (0/4) Epoch 3, batch 6050, loss[loss=0.2396, simple_loss=0.2876, pruned_loss=0.0958, over 21240.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3094, pruned_loss=0.08536, over 4260787.86 frames. ], batch size: 608, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:32:49,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-19 19:32:50,650 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:33:26,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-19 19:33:48,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=402354.0, ans=0.0 2023-06-19 19:34:07,189 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.314e+02 2.772e+02 3.160e+02 4.254e+02, threshold=5.544e+02, percent-clipped=0.0 2023-06-19 19:34:51,211 INFO [train.py:996] (0/4) Epoch 3, batch 6100, loss[loss=0.2351, simple_loss=0.3038, pruned_loss=0.08322, over 21765.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3083, pruned_loss=0.0844, over 4267808.62 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:35:24,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=402594.0, ans=0.2 2023-06-19 19:35:40,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=402594.0, ans=0.0 2023-06-19 19:36:36,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402774.0, ans=0.1 2023-06-19 19:37:00,843 INFO [train.py:996] (0/4) Epoch 3, batch 6150, loss[loss=0.2858, simple_loss=0.3459, pruned_loss=0.1129, over 21645.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3125, pruned_loss=0.08844, over 4274309.00 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:37:55,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402954.0, ans=0.1 2023-06-19 19:38:11,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.614e+02 3.017e+02 3.568e+02 5.916e+02, threshold=6.034e+02, percent-clipped=1.0 2023-06-19 19:38:20,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=403014.0, ans=0.125 2023-06-19 19:38:52,132 INFO [train.py:996] (0/4) Epoch 3, batch 6200, loss[loss=0.2656, simple_loss=0.326, pruned_loss=0.1026, over 21313.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3146, pruned_loss=0.08755, over 4268531.90 frames. ], batch size: 159, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:39:30,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-19 19:40:11,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=403254.0, ans=0.125 2023-06-19 19:40:50,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=403314.0, ans=0.0 2023-06-19 19:40:54,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-19 19:41:08,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=403374.0, ans=0.125 2023-06-19 19:41:13,514 INFO [train.py:996] (0/4) Epoch 3, batch 6250, loss[loss=0.2193, simple_loss=0.3246, pruned_loss=0.057, over 21663.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3188, pruned_loss=0.08767, over 4263749.44 frames. ], batch size: 263, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:41:28,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=403434.0, ans=10.0 2023-06-19 19:42:02,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=403494.0, ans=0.125 2023-06-19 19:42:32,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.931e+02 3.643e+02 4.697e+02 7.748e+02, threshold=7.286e+02, percent-clipped=9.0 2023-06-19 19:42:34,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=403614.0, ans=0.2 2023-06-19 19:43:26,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=403734.0, ans=0.025 2023-06-19 19:43:33,702 INFO [train.py:996] (0/4) Epoch 3, batch 6300, loss[loss=0.3119, simple_loss=0.3561, pruned_loss=0.1338, over 21335.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3242, pruned_loss=0.08781, over 4266867.62 frames. ], batch size: 144, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:44:10,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=403794.0, ans=0.0 2023-06-19 19:44:36,302 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:44:50,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=403914.0, ans=0.0 2023-06-19 19:45:25,059 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:45:37,159 INFO [train.py:996] (0/4) Epoch 3, batch 6350, loss[loss=0.2948, simple_loss=0.3546, pruned_loss=0.1175, over 21601.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3294, pruned_loss=0.09393, over 4268036.23 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:45:45,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=404034.0, ans=0.125 2023-06-19 19:46:47,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=404154.0, ans=0.0 2023-06-19 19:46:49,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=404214.0, ans=0.2 2023-06-19 19:46:50,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.707e+02 3.149e+02 3.828e+02 5.758e+02, threshold=6.298e+02, percent-clipped=0.0 2023-06-19 19:47:12,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=404214.0, ans=0.2 2023-06-19 19:47:44,508 INFO [train.py:996] (0/4) Epoch 3, batch 6400, loss[loss=0.2869, simple_loss=0.3613, pruned_loss=0.1063, over 21446.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3359, pruned_loss=0.09769, over 4268483.94 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:48:28,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=404454.0, ans=0.0 2023-06-19 19:49:39,741 INFO [train.py:996] (0/4) Epoch 3, batch 6450, loss[loss=0.3119, simple_loss=0.419, pruned_loss=0.1024, over 19750.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3392, pruned_loss=0.09788, over 4266075.03 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:49:43,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=404634.0, ans=0.125 2023-06-19 19:50:27,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-19 19:50:42,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.507e+02 2.854e+02 3.745e+02 6.123e+02, threshold=5.708e+02, percent-clipped=0.0 2023-06-19 19:50:49,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=404814.0, ans=0.0 2023-06-19 19:50:57,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-19 19:51:20,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=404874.0, ans=0.125 2023-06-19 19:51:36,618 INFO [train.py:996] (0/4) Epoch 3, batch 6500, loss[loss=0.2238, simple_loss=0.2876, pruned_loss=0.08006, over 21731.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3311, pruned_loss=0.09521, over 4260597.42 frames. ], batch size: 124, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:52:29,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=405054.0, ans=0.0 2023-06-19 19:52:32,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=405054.0, ans=0.125 2023-06-19 19:53:05,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=405174.0, ans=0.0 2023-06-19 19:53:18,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405174.0, ans=0.125 2023-06-19 19:53:18,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=405174.0, ans=0.0 2023-06-19 19:53:27,024 INFO [train.py:996] (0/4) Epoch 3, batch 6550, loss[loss=0.2134, simple_loss=0.2811, pruned_loss=0.07282, over 21371.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3289, pruned_loss=0.09421, over 4260128.98 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:54:10,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405354.0, ans=0.125 2023-06-19 19:54:37,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=405354.0, ans=0.07 2023-06-19 19:54:42,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.670e+02 3.153e+02 4.224e+02 7.538e+02, threshold=6.306e+02, percent-clipped=8.0 2023-06-19 19:55:02,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-19 19:55:22,194 INFO [train.py:996] (0/4) Epoch 3, batch 6600, loss[loss=0.228, simple_loss=0.2868, pruned_loss=0.08456, over 21122.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3223, pruned_loss=0.09365, over 4257604.10 frames. ], batch size: 143, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:55:45,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=405594.0, ans=0.125 2023-06-19 19:57:08,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=405774.0, ans=0.0 2023-06-19 19:57:13,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=405774.0, ans=0.125 2023-06-19 19:57:20,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-19 19:57:21,372 INFO [train.py:996] (0/4) Epoch 3, batch 6650, loss[loss=0.2767, simple_loss=0.3359, pruned_loss=0.1087, over 21499.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3147, pruned_loss=0.0902, over 4264461.84 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:57:26,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=405834.0, ans=0.125 2023-06-19 19:57:41,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=405894.0, ans=0.125 2023-06-19 19:57:43,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405894.0, ans=0.1 2023-06-19 19:58:17,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.325e+02 2.612e+02 3.036e+02 4.161e+02, threshold=5.224e+02, percent-clipped=0.0 2023-06-19 19:58:57,726 INFO [train.py:996] (0/4) Epoch 3, batch 6700, loss[loss=0.2098, simple_loss=0.273, pruned_loss=0.07329, over 21493.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3091, pruned_loss=0.08984, over 4249461.57 frames. ], batch size: 230, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 19:59:04,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=406134.0, ans=0.0 2023-06-19 20:00:08,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406314.0, ans=0.1 2023-06-19 20:00:57,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=406434.0, ans=0.0 2023-06-19 20:00:58,757 INFO [train.py:996] (0/4) Epoch 3, batch 6750, loss[loss=0.2885, simple_loss=0.3362, pruned_loss=0.1204, over 21752.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3078, pruned_loss=0.09049, over 4242432.78 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:01:08,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406434.0, ans=0.1 2023-06-19 20:02:00,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.669e+02 3.031e+02 3.481e+02 8.147e+02, threshold=6.062e+02, percent-clipped=3.0 2023-06-19 20:02:05,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=406614.0, ans=0.2 2023-06-19 20:02:20,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=406674.0, ans=0.04949747468305833 2023-06-19 20:02:41,230 INFO [train.py:996] (0/4) Epoch 3, batch 6800, loss[loss=0.258, simple_loss=0.3015, pruned_loss=0.1072, over 21467.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3098, pruned_loss=0.09227, over 4239475.39 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:02:50,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-19 20:03:33,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406854.0, ans=0.1 2023-06-19 20:04:21,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=406974.0, ans=0.0 2023-06-19 20:04:34,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-19 20:04:39,336 INFO [train.py:996] (0/4) Epoch 3, batch 6850, loss[loss=0.265, simple_loss=0.3307, pruned_loss=0.0997, over 21873.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.309, pruned_loss=0.09363, over 4250147.50 frames. ], batch size: 107, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:05:03,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=407094.0, ans=0.1 2023-06-19 20:05:53,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.761e+02 3.161e+02 3.726e+02 8.116e+02, threshold=6.323e+02, percent-clipped=2.0 2023-06-19 20:06:03,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=22.5 2023-06-19 20:06:47,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=407334.0, ans=0.0 2023-06-19 20:06:47,900 INFO [train.py:996] (0/4) Epoch 3, batch 6900, loss[loss=0.2851, simple_loss=0.397, pruned_loss=0.08658, over 19734.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3119, pruned_loss=0.09402, over 4260647.05 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 20:06:54,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=407334.0, ans=0.2 2023-06-19 20:07:05,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407334.0, ans=0.1 2023-06-19 20:07:34,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=407394.0, ans=0.025 2023-06-19 20:08:06,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407514.0, ans=0.1 2023-06-19 20:08:08,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=407514.0, ans=0.0 2023-06-19 20:08:39,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=407574.0, ans=0.125 2023-06-19 20:08:53,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=407574.0, ans=0.09899494936611666 2023-06-19 20:08:57,085 INFO [train.py:996] (0/4) Epoch 3, batch 6950, loss[loss=0.2526, simple_loss=0.3202, pruned_loss=0.09244, over 21903.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3118, pruned_loss=0.08996, over 4265790.56 frames. ], batch size: 316, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:09:01,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=407634.0, ans=0.04949747468305833 2023-06-19 20:10:12,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.540e+02 2.981e+02 3.670e+02 6.199e+02, threshold=5.963e+02, percent-clipped=0.0 2023-06-19 20:10:21,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=407814.0, ans=0.025 2023-06-19 20:10:22,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-19 20:10:31,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=407874.0, ans=0.125 2023-06-19 20:10:48,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=407874.0, ans=0.0 2023-06-19 20:10:59,037 INFO [train.py:996] (0/4) Epoch 3, batch 7000, loss[loss=0.2476, simple_loss=0.3106, pruned_loss=0.0923, over 21809.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3155, pruned_loss=0.0926, over 4267121.79 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:11:28,333 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-68000.pt 2023-06-19 20:11:34,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2023-06-19 20:12:36,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=408174.0, ans=0.125 2023-06-19 20:12:47,651 INFO [train.py:996] (0/4) Epoch 3, batch 7050, loss[loss=0.2358, simple_loss=0.3035, pruned_loss=0.08405, over 15427.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3151, pruned_loss=0.09226, over 4258718.43 frames. ], batch size: 60, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:13:32,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=408294.0, ans=0.2 2023-06-19 20:13:37,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-19 20:14:15,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.619e+02 2.979e+02 3.619e+02 9.670e+02, threshold=5.957e+02, percent-clipped=3.0 2023-06-19 20:14:52,043 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:14:59,111 INFO [train.py:996] (0/4) Epoch 3, batch 7100, loss[loss=0.252, simple_loss=0.3359, pruned_loss=0.08411, over 21661.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3185, pruned_loss=0.0935, over 4243720.24 frames. ], batch size: 441, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:15:33,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=408594.0, ans=0.125 2023-06-19 20:15:37,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-19 20:15:40,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-19 20:15:41,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=408594.0, ans=0.125 2023-06-19 20:16:02,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=408654.0, ans=0.125 2023-06-19 20:16:45,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=408714.0, ans=0.125 2023-06-19 20:17:01,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=408774.0, ans=0.0 2023-06-19 20:17:16,297 INFO [train.py:996] (0/4) Epoch 3, batch 7150, loss[loss=0.2571, simple_loss=0.324, pruned_loss=0.09514, over 21901.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3161, pruned_loss=0.09128, over 4245338.78 frames. ], batch size: 372, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:17:24,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=408834.0, ans=0.025 2023-06-19 20:17:27,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=408834.0, ans=0.95 2023-06-19 20:17:52,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=22.5 2023-06-19 20:18:32,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.361e+02 2.956e+02 3.378e+02 5.911e+02, threshold=5.912e+02, percent-clipped=0.0 2023-06-19 20:19:00,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=409074.0, ans=0.125 2023-06-19 20:19:23,242 INFO [train.py:996] (0/4) Epoch 3, batch 7200, loss[loss=0.2518, simple_loss=0.31, pruned_loss=0.09681, over 21862.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3192, pruned_loss=0.09361, over 4244511.80 frames. ], batch size: 373, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:19:28,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=409134.0, ans=0.125 2023-06-19 20:20:44,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=409314.0, ans=0.0 2023-06-19 20:20:47,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=409314.0, ans=0.0 2023-06-19 20:21:16,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=409374.0, ans=10.0 2023-06-19 20:21:23,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=409374.0, ans=0.04949747468305833 2023-06-19 20:21:31,195 INFO [train.py:996] (0/4) Epoch 3, batch 7250, loss[loss=0.2834, simple_loss=0.3111, pruned_loss=0.1279, over 21391.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3157, pruned_loss=0.09368, over 4253832.64 frames. ], batch size: 509, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:22:43,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.618e+02 2.951e+02 3.716e+02 8.405e+02, threshold=5.903e+02, percent-clipped=1.0 2023-06-19 20:23:23,304 INFO [train.py:996] (0/4) Epoch 3, batch 7300, loss[loss=0.1933, simple_loss=0.258, pruned_loss=0.06431, over 21548.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3102, pruned_loss=0.09245, over 4255531.88 frames. ], batch size: 263, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:23:24,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-19 20:24:49,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 20:25:18,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=409974.0, ans=0.0 2023-06-19 20:25:26,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-19 20:25:26,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-19 20:25:27,400 INFO [train.py:996] (0/4) Epoch 3, batch 7350, loss[loss=0.2686, simple_loss=0.3247, pruned_loss=0.1063, over 21659.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3075, pruned_loss=0.09324, over 4256093.11 frames. ], batch size: 332, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:25:39,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=410034.0, ans=0.2 2023-06-19 20:25:51,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=410034.0, ans=0.0 2023-06-19 20:26:06,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-19 20:26:10,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=410154.0, ans=0.2 2023-06-19 20:26:46,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.697e+02 3.166e+02 3.579e+02 5.616e+02, threshold=6.332e+02, percent-clipped=0.0 2023-06-19 20:27:08,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=15.0 2023-06-19 20:27:34,129 INFO [train.py:996] (0/4) Epoch 3, batch 7400, loss[loss=0.3198, simple_loss=0.3903, pruned_loss=0.1246, over 21555.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3149, pruned_loss=0.0957, over 4258469.22 frames. ], batch size: 473, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:28:00,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=410394.0, ans=0.125 2023-06-19 20:28:01,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-19 20:29:03,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-19 20:29:32,634 INFO [train.py:996] (0/4) Epoch 3, batch 7450, loss[loss=0.2369, simple_loss=0.2896, pruned_loss=0.0921, over 21416.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3133, pruned_loss=0.09379, over 4265861.96 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:29:34,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=410634.0, ans=0.125 2023-06-19 20:30:09,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=410694.0, ans=0.05 2023-06-19 20:30:26,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410754.0, ans=0.125 2023-06-19 20:30:55,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.751e+02 3.404e+02 4.260e+02 8.554e+02, threshold=6.809e+02, percent-clipped=4.0 2023-06-19 20:31:51,008 INFO [train.py:996] (0/4) Epoch 3, batch 7500, loss[loss=0.2822, simple_loss=0.3549, pruned_loss=0.1048, over 21857.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3181, pruned_loss=0.09608, over 4269025.06 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:32:04,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=410934.0, ans=10.0 2023-06-19 20:32:31,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=411054.0, ans=0.125 2023-06-19 20:32:56,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=411114.0, ans=0.04949747468305833 2023-06-19 20:33:46,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411174.0, ans=0.1 2023-06-19 20:33:53,283 INFO [train.py:996] (0/4) Epoch 3, batch 7550, loss[loss=0.2192, simple_loss=0.2969, pruned_loss=0.07073, over 21218.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3267, pruned_loss=0.09553, over 4270826.04 frames. ], batch size: 143, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:34:47,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=411354.0, ans=0.125 2023-06-19 20:34:54,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.594e+02 3.226e+02 4.168e+02 6.750e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 20:35:42,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=411474.0, ans=0.125 2023-06-19 20:35:53,977 INFO [train.py:996] (0/4) Epoch 3, batch 7600, loss[loss=0.278, simple_loss=0.3345, pruned_loss=0.1108, over 21899.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3241, pruned_loss=0.09314, over 4270645.46 frames. ], batch size: 351, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:35:59,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-19 20:36:04,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-19 20:36:08,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=411594.0, ans=0.125 2023-06-19 20:36:11,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=411594.0, ans=0.125 2023-06-19 20:37:04,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=411714.0, ans=0.025 2023-06-19 20:37:39,155 INFO [train.py:996] (0/4) Epoch 3, batch 7650, loss[loss=0.2504, simple_loss=0.3096, pruned_loss=0.09566, over 21824.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3235, pruned_loss=0.09426, over 4277209.37 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:37:44,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=411834.0, ans=0.0 2023-06-19 20:38:21,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-19 20:38:50,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=411954.0, ans=0.125 2023-06-19 20:38:54,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.672e+02 3.022e+02 3.544e+02 5.089e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 20:39:32,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=412074.0, ans=0.125 2023-06-19 20:39:40,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-19 20:39:48,295 INFO [train.py:996] (0/4) Epoch 3, batch 7700, loss[loss=0.2889, simple_loss=0.3516, pruned_loss=0.1131, over 21467.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3272, pruned_loss=0.09768, over 4282251.59 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:40:50,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=412254.0, ans=0.0 2023-06-19 20:41:07,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=412314.0, ans=0.2 2023-06-19 20:41:31,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=412374.0, ans=0.04949747468305833 2023-06-19 20:41:41,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-19 20:42:14,287 INFO [train.py:996] (0/4) Epoch 3, batch 7750, loss[loss=0.2285, simple_loss=0.2734, pruned_loss=0.09176, over 20795.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3319, pruned_loss=0.09781, over 4284927.91 frames. ], batch size: 609, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:42:50,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=412494.0, ans=0.125 2023-06-19 20:43:04,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=412554.0, ans=0.125 2023-06-19 20:43:07,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=412554.0, ans=0.125 2023-06-19 20:43:23,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=412614.0, ans=0.05 2023-06-19 20:43:25,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 3.175e+02 3.701e+02 4.532e+02 8.848e+02, threshold=7.402e+02, percent-clipped=6.0 2023-06-19 20:43:54,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=412614.0, ans=0.0 2023-06-19 20:44:17,434 INFO [train.py:996] (0/4) Epoch 3, batch 7800, loss[loss=0.2071, simple_loss=0.2522, pruned_loss=0.08095, over 21277.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3343, pruned_loss=0.09905, over 4279634.03 frames. ], batch size: 143, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:44:52,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=412854.0, ans=0.125 2023-06-19 20:45:24,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=412914.0, ans=0.125 2023-06-19 20:45:44,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=412974.0, ans=0.125 2023-06-19 20:46:01,771 INFO [train.py:996] (0/4) Epoch 3, batch 7850, loss[loss=0.2617, simple_loss=0.3076, pruned_loss=0.108, over 21584.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3301, pruned_loss=0.09863, over 4285813.46 frames. ], batch size: 415, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:46:03,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=413034.0, ans=0.0 2023-06-19 20:46:12,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=413034.0, ans=0.125 2023-06-19 20:46:23,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-19 20:46:39,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=413094.0, ans=0.125 2023-06-19 20:47:05,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.925e+02 3.666e+02 4.492e+02 8.258e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-19 20:47:13,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=413214.0, ans=0.125 2023-06-19 20:47:42,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-19 20:48:04,380 INFO [train.py:996] (0/4) Epoch 3, batch 7900, loss[loss=0.3841, simple_loss=0.4512, pruned_loss=0.1585, over 21424.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.324, pruned_loss=0.09677, over 4275602.36 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:48:35,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=413394.0, ans=0.1 2023-06-19 20:49:05,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=413454.0, ans=0.1 2023-06-19 20:49:45,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=413574.0, ans=0.125 2023-06-19 20:50:12,452 INFO [train.py:996] (0/4) Epoch 3, batch 7950, loss[loss=0.2651, simple_loss=0.3625, pruned_loss=0.08382, over 21734.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.329, pruned_loss=0.09615, over 4269069.56 frames. ], batch size: 351, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 20:50:29,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=413634.0, ans=0.125 2023-06-19 20:50:50,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=413694.0, ans=0.125 2023-06-19 20:50:57,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=413694.0, ans=0.125 2023-06-19 20:51:02,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-19 20:51:18,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.993e+02 3.629e+02 4.291e+02 8.050e+02, threshold=7.259e+02, percent-clipped=3.0 2023-06-19 20:52:13,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=413874.0, ans=0.025 2023-06-19 20:52:18,829 INFO [train.py:996] (0/4) Epoch 3, batch 8000, loss[loss=0.2818, simple_loss=0.3645, pruned_loss=0.0995, over 21609.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3361, pruned_loss=0.1011, over 4269289.36 frames. ], batch size: 389, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:52:26,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=413934.0, ans=0.125 2023-06-19 20:53:31,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=414114.0, ans=0.04949747468305833 2023-06-19 20:54:04,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=414174.0, ans=0.0 2023-06-19 20:54:25,796 INFO [train.py:996] (0/4) Epoch 3, batch 8050, loss[loss=0.3118, simple_loss=0.4019, pruned_loss=0.1108, over 21263.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3398, pruned_loss=0.1007, over 4262469.87 frames. ], batch size: 549, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:55:12,985 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:55:31,939 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:55:33,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=414354.0, ans=0.0 2023-06-19 20:55:38,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=414414.0, ans=0.125 2023-06-19 20:55:40,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.904e+02 3.441e+02 4.463e+02 1.081e+03, threshold=6.883e+02, percent-clipped=4.0 2023-06-19 20:55:45,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=414414.0, ans=0.125 2023-06-19 20:56:18,523 INFO [train.py:996] (0/4) Epoch 3, batch 8100, loss[loss=0.2532, simple_loss=0.3133, pruned_loss=0.09652, over 21516.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3372, pruned_loss=0.1015, over 4273528.44 frames. ], batch size: 177, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 20:58:00,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=414714.0, ans=0.5 2023-06-19 20:58:31,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=414774.0, ans=10.0 2023-06-19 20:58:51,123 INFO [train.py:996] (0/4) Epoch 3, batch 8150, loss[loss=0.2393, simple_loss=0.3287, pruned_loss=0.07497, over 21645.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3439, pruned_loss=0.102, over 4272529.79 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 20:59:05,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=414834.0, ans=0.0 2023-06-19 20:59:10,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=414894.0, ans=0.2 2023-06-19 20:59:33,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=33.06 vs. limit=22.5 2023-06-19 20:59:34,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=414954.0, ans=15.0 2023-06-19 20:59:43,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=414954.0, ans=0.125 2023-06-19 21:00:01,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.727e+02 3.206e+02 3.958e+02 8.712e+02, threshold=6.412e+02, percent-clipped=8.0 2023-06-19 21:00:28,965 INFO [train.py:996] (0/4) Epoch 3, batch 8200, loss[loss=0.2524, simple_loss=0.3082, pruned_loss=0.09832, over 21802.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3348, pruned_loss=0.09948, over 4272243.85 frames. ], batch size: 352, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:00:29,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=415134.0, ans=0.125 2023-06-19 21:00:42,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=415134.0, ans=0.125 2023-06-19 21:01:28,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415254.0, ans=0.1 2023-06-19 21:01:56,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=415314.0, ans=0.125 2023-06-19 21:01:58,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-19 21:02:03,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=415374.0, ans=10.0 2023-06-19 21:02:26,219 INFO [train.py:996] (0/4) Epoch 3, batch 8250, loss[loss=0.224, simple_loss=0.2917, pruned_loss=0.07815, over 21840.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3303, pruned_loss=0.09836, over 4269851.40 frames. ], batch size: 102, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:03:21,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=415494.0, ans=0.0 2023-06-19 21:03:36,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=415554.0, ans=0.125 2023-06-19 21:03:39,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415554.0, ans=0.1 2023-06-19 21:03:43,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=415614.0, ans=0.0 2023-06-19 21:03:47,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.708e+02 3.103e+02 3.539e+02 5.585e+02, threshold=6.206e+02, percent-clipped=0.0 2023-06-19 21:04:29,219 INFO [train.py:996] (0/4) Epoch 3, batch 8300, loss[loss=0.3348, simple_loss=0.3931, pruned_loss=0.1382, over 21477.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3287, pruned_loss=0.0952, over 4271012.16 frames. ], batch size: 508, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:04:48,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=415734.0, ans=0.125 2023-06-19 21:05:11,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=415794.0, ans=0.0 2023-06-19 21:05:26,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=415854.0, ans=0.0 2023-06-19 21:05:48,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-19 21:05:51,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.25 vs. limit=22.5 2023-06-19 21:06:12,433 INFO [train.py:996] (0/4) Epoch 3, batch 8350, loss[loss=0.2376, simple_loss=0.31, pruned_loss=0.08254, over 21347.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3272, pruned_loss=0.09218, over 4266346.59 frames. ], batch size: 160, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:06:25,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=416034.0, ans=0.125 2023-06-19 21:06:39,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=416094.0, ans=0.0 2023-06-19 21:07:24,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.460e+02 2.877e+02 3.560e+02 6.454e+02, threshold=5.755e+02, percent-clipped=1.0 2023-06-19 21:07:25,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=416214.0, ans=0.2 2023-06-19 21:08:19,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-19 21:08:24,291 INFO [train.py:996] (0/4) Epoch 3, batch 8400, loss[loss=0.2329, simple_loss=0.3208, pruned_loss=0.07252, over 21739.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3258, pruned_loss=0.09062, over 4261136.14 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:08:37,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=416334.0, ans=0.1 2023-06-19 21:08:39,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=416334.0, ans=0.0 2023-06-19 21:08:43,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=416394.0, ans=0.0 2023-06-19 21:08:46,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=416394.0, ans=0.0 2023-06-19 21:09:24,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416514.0, ans=0.1 2023-06-19 21:09:31,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-19 21:09:41,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=416574.0, ans=0.0 2023-06-19 21:09:50,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=416574.0, ans=0.125 2023-06-19 21:09:54,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=416634.0, ans=0.125 2023-06-19 21:09:55,431 INFO [train.py:996] (0/4) Epoch 3, batch 8450, loss[loss=0.2881, simple_loss=0.3553, pruned_loss=0.1104, over 16917.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3227, pruned_loss=0.09002, over 4264166.42 frames. ], batch size: 60, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:09:55,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=416634.0, ans=0.2 2023-06-19 21:09:55,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=416634.0, ans=0.0 2023-06-19 21:10:54,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=416754.0, ans=0.0 2023-06-19 21:11:06,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.528e+02 3.268e+02 4.022e+02 7.297e+02, threshold=6.535e+02, percent-clipped=4.0 2023-06-19 21:11:33,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=416874.0, ans=0.2 2023-06-19 21:11:53,074 INFO [train.py:996] (0/4) Epoch 3, batch 8500, loss[loss=0.324, simple_loss=0.4326, pruned_loss=0.1077, over 20753.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3207, pruned_loss=0.09217, over 4251525.95 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:12:07,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-19 21:12:15,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=416994.0, ans=0.07 2023-06-19 21:12:31,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=416994.0, ans=0.125 2023-06-19 21:12:39,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-19 21:13:08,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=417054.0, ans=0.2 2023-06-19 21:13:58,921 INFO [train.py:996] (0/4) Epoch 3, batch 8550, loss[loss=0.244, simple_loss=0.3181, pruned_loss=0.08489, over 21190.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3244, pruned_loss=0.09467, over 4257487.20 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:14:05,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=417234.0, ans=0.0 2023-06-19 21:14:06,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=417234.0, ans=0.0 2023-06-19 21:14:23,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=417294.0, ans=0.0 2023-06-19 21:14:39,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=417294.0, ans=0.2 2023-06-19 21:15:14,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.754e+02 3.232e+02 3.747e+02 6.984e+02, threshold=6.464e+02, percent-clipped=1.0 2023-06-19 21:15:38,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=417414.0, ans=0.125 2023-06-19 21:15:38,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=417414.0, ans=0.0 2023-06-19 21:15:55,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=417474.0, ans=0.125 2023-06-19 21:16:05,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-19 21:16:08,810 INFO [train.py:996] (0/4) Epoch 3, batch 8600, loss[loss=0.3783, simple_loss=0.4295, pruned_loss=0.1635, over 21491.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3333, pruned_loss=0.09779, over 4258466.06 frames. ], batch size: 508, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:16:09,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=417534.0, ans=0.02 2023-06-19 21:16:38,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=417594.0, ans=0.125 2023-06-19 21:16:48,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=417594.0, ans=0.125 2023-06-19 21:17:16,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=417714.0, ans=15.0 2023-06-19 21:17:33,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=417714.0, ans=0.0 2023-06-19 21:18:02,541 INFO [train.py:996] (0/4) Epoch 3, batch 8650, loss[loss=0.1842, simple_loss=0.2814, pruned_loss=0.04356, over 21758.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.339, pruned_loss=0.09911, over 4266780.69 frames. ], batch size: 282, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:18:30,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-19 21:19:20,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 2.624e+02 3.051e+02 3.895e+02 8.480e+02, threshold=6.103e+02, percent-clipped=4.0 2023-06-19 21:19:52,987 INFO [train.py:996] (0/4) Epoch 3, batch 8700, loss[loss=0.2209, simple_loss=0.297, pruned_loss=0.07243, over 21227.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3293, pruned_loss=0.09507, over 4271389.57 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:20:13,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418134.0, ans=0.1 2023-06-19 21:20:21,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=418194.0, ans=0.0 2023-06-19 21:20:31,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=418194.0, ans=0.2 2023-06-19 21:20:34,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.33 vs. limit=10.0 2023-06-19 21:21:02,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=418314.0, ans=0.125 2023-06-19 21:21:14,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=418314.0, ans=0.125 2023-06-19 21:21:16,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-19 21:22:05,625 INFO [train.py:996] (0/4) Epoch 3, batch 8750, loss[loss=0.2309, simple_loss=0.2993, pruned_loss=0.08131, over 21983.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3248, pruned_loss=0.09568, over 4275718.32 frames. ], batch size: 103, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:22:40,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=418554.0, ans=0.015 2023-06-19 21:22:59,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.653e+02 3.166e+02 3.937e+02 6.791e+02, threshold=6.332e+02, percent-clipped=3.0 2023-06-19 21:22:59,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=418614.0, ans=0.04949747468305833 2023-06-19 21:23:05,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=418614.0, ans=0.125 2023-06-19 21:23:43,287 INFO [train.py:996] (0/4) Epoch 3, batch 8800, loss[loss=0.2776, simple_loss=0.3567, pruned_loss=0.09925, over 21692.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3324, pruned_loss=0.09773, over 4279263.09 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:23:48,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=418734.0, ans=0.125 2023-06-19 21:23:48,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-06-19 21:23:51,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=418734.0, ans=0.04949747468305833 2023-06-19 21:23:53,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-19 21:24:13,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=418854.0, ans=0.0 2023-06-19 21:25:07,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=418974.0, ans=0.0 2023-06-19 21:25:23,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=418974.0, ans=0.125 2023-06-19 21:25:29,645 INFO [train.py:996] (0/4) Epoch 3, batch 8850, loss[loss=0.2657, simple_loss=0.3348, pruned_loss=0.09833, over 21596.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3405, pruned_loss=0.1001, over 4268723.58 frames. ], batch size: 414, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:25:34,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=419034.0, ans=0.125 2023-06-19 21:26:07,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=419154.0, ans=0.1 2023-06-19 21:26:21,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=419154.0, ans=0.0 2023-06-19 21:26:43,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.885e+02 3.451e+02 3.948e+02 6.880e+02, threshold=6.902e+02, percent-clipped=2.0 2023-06-19 21:26:57,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=419214.0, ans=0.95 2023-06-19 21:27:20,929 INFO [train.py:996] (0/4) Epoch 3, batch 8900, loss[loss=0.2501, simple_loss=0.311, pruned_loss=0.09464, over 21781.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3335, pruned_loss=0.09821, over 4267934.53 frames. ], batch size: 371, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:27:24,607 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:27:27,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=419334.0, ans=0.0 2023-06-19 21:27:30,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=419334.0, ans=0.125 2023-06-19 21:28:48,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=15.0 2023-06-19 21:29:26,019 INFO [train.py:996] (0/4) Epoch 3, batch 8950, loss[loss=0.2845, simple_loss=0.3804, pruned_loss=0.09428, over 21256.00 frames. ], tot_loss[loss=0.265, simple_loss=0.335, pruned_loss=0.0975, over 4268439.30 frames. ], batch size: 549, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:29:30,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=419634.0, ans=0.2 2023-06-19 21:29:51,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=419694.0, ans=0.0 2023-06-19 21:30:32,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=419754.0, ans=0.125 2023-06-19 21:30:46,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.693e+02 3.248e+02 3.972e+02 8.193e+02, threshold=6.496e+02, percent-clipped=3.0 2023-06-19 21:30:55,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=419814.0, ans=0.125 2023-06-19 21:31:13,410 INFO [train.py:996] (0/4) Epoch 3, batch 9000, loss[loss=0.2987, simple_loss=0.3753, pruned_loss=0.111, over 20722.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3275, pruned_loss=0.09675, over 4265497.97 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:31:13,412 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 21:31:59,486 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2739, simple_loss=0.372, pruned_loss=0.08794, over 1796401.00 frames. 2023-06-19 21:31:59,488 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 21:32:07,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=419934.0, ans=0.125 2023-06-19 21:32:07,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419934.0, ans=0.1 2023-06-19 21:33:01,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=22.5 2023-06-19 21:33:49,582 INFO [train.py:996] (0/4) Epoch 3, batch 9050, loss[loss=0.2212, simple_loss=0.3019, pruned_loss=0.07026, over 21700.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3217, pruned_loss=0.09292, over 4261670.17 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:34:25,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=420234.0, ans=0.125 2023-06-19 21:34:56,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=420354.0, ans=0.125 2023-06-19 21:35:09,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.569e+02 2.874e+02 3.466e+02 6.244e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-19 21:35:09,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=420414.0, ans=0.125 2023-06-19 21:35:55,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-19 21:36:03,323 INFO [train.py:996] (0/4) Epoch 3, batch 9100, loss[loss=0.2433, simple_loss=0.3322, pruned_loss=0.0772, over 21713.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3282, pruned_loss=0.09529, over 4259055.47 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:36:25,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=420534.0, ans=0.2 2023-06-19 21:36:37,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=420594.0, ans=0.2 2023-06-19 21:37:23,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=420714.0, ans=0.0 2023-06-19 21:37:37,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=10.0 2023-06-19 21:37:38,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=420714.0, ans=0.015 2023-06-19 21:38:17,220 INFO [train.py:996] (0/4) Epoch 3, batch 9150, loss[loss=0.2155, simple_loss=0.298, pruned_loss=0.06654, over 21736.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3323, pruned_loss=0.09415, over 4257680.37 frames. ], batch size: 112, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:39:38,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-19 21:39:44,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-19 21:39:46,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.484e+02 2.819e+02 3.300e+02 4.761e+02, threshold=5.639e+02, percent-clipped=0.0 2023-06-19 21:40:17,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=421074.0, ans=0.125 2023-06-19 21:40:30,503 INFO [train.py:996] (0/4) Epoch 3, batch 9200, loss[loss=0.2035, simple_loss=0.2856, pruned_loss=0.06074, over 21413.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3334, pruned_loss=0.09208, over 4269426.20 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:40:56,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=421194.0, ans=0.2 2023-06-19 21:41:36,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=421254.0, ans=0.2 2023-06-19 21:42:36,623 INFO [train.py:996] (0/4) Epoch 3, batch 9250, loss[loss=0.2815, simple_loss=0.3333, pruned_loss=0.1148, over 21497.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3368, pruned_loss=0.09545, over 4266969.13 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:42:52,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=421494.0, ans=0.0 2023-06-19 21:43:20,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=421554.0, ans=0.125 2023-06-19 21:43:30,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.625e+02 3.038e+02 3.484e+02 5.422e+02, threshold=6.077e+02, percent-clipped=0.0 2023-06-19 21:43:32,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=421614.0, ans=0.5 2023-06-19 21:44:02,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-19 21:44:17,070 INFO [train.py:996] (0/4) Epoch 3, batch 9300, loss[loss=0.2416, simple_loss=0.2901, pruned_loss=0.09657, over 20759.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3308, pruned_loss=0.09518, over 4270742.45 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 21:44:23,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=421734.0, ans=0.0 2023-06-19 21:44:44,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=421794.0, ans=0.125 2023-06-19 21:44:50,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=421794.0, ans=0.125 2023-06-19 21:44:56,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2023-06-19 21:45:27,500 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:45:31,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-19 21:46:29,841 INFO [train.py:996] (0/4) Epoch 3, batch 9350, loss[loss=0.2862, simple_loss=0.3335, pruned_loss=0.1195, over 20068.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.338, pruned_loss=0.09671, over 4271232.41 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:46:34,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=422034.0, ans=0.125 2023-06-19 21:47:21,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=422154.0, ans=0.125 2023-06-19 21:47:50,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.663e+02 3.172e+02 3.860e+02 6.008e+02, threshold=6.345e+02, percent-clipped=0.0 2023-06-19 21:47:52,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=422214.0, ans=0.0 2023-06-19 21:48:00,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-19 21:48:14,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-19 21:48:21,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=422334.0, ans=0.2 2023-06-19 21:48:22,862 INFO [train.py:996] (0/4) Epoch 3, batch 9400, loss[loss=0.2197, simple_loss=0.2853, pruned_loss=0.077, over 21879.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3395, pruned_loss=0.09741, over 4265521.68 frames. ], batch size: 107, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:48:23,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=422334.0, ans=0.125 2023-06-19 21:48:26,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=422334.0, ans=0.125 2023-06-19 21:48:39,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=422334.0, ans=0.125 2023-06-19 21:49:17,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422454.0, ans=0.1 2023-06-19 21:49:50,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=422514.0, ans=0.0 2023-06-19 21:50:01,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=422574.0, ans=0.0 2023-06-19 21:50:11,535 INFO [train.py:996] (0/4) Epoch 3, batch 9450, loss[loss=0.2282, simple_loss=0.3155, pruned_loss=0.07041, over 20795.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3307, pruned_loss=0.09645, over 4266864.40 frames. ], batch size: 609, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:50:37,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-19 21:51:20,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=422814.0, ans=0.0 2023-06-19 21:51:28,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.824e+02 3.255e+02 3.921e+02 7.411e+02, threshold=6.510e+02, percent-clipped=5.0 2023-06-19 21:51:50,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=422874.0, ans=0.125 2023-06-19 21:52:04,913 INFO [train.py:996] (0/4) Epoch 3, batch 9500, loss[loss=0.2197, simple_loss=0.2863, pruned_loss=0.07651, over 21349.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3225, pruned_loss=0.09482, over 4263676.20 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:52:33,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=422994.0, ans=0.0 2023-06-19 21:52:41,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=422994.0, ans=0.125 2023-06-19 21:52:58,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-19 21:53:17,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=423114.0, ans=0.125 2023-06-19 21:53:43,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=423174.0, ans=0.125 2023-06-19 21:53:51,867 INFO [train.py:996] (0/4) Epoch 3, batch 9550, loss[loss=0.2733, simple_loss=0.3357, pruned_loss=0.1054, over 21685.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3275, pruned_loss=0.09761, over 4271647.27 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:55:14,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.688e+02 3.199e+02 3.613e+02 5.972e+02, threshold=6.398e+02, percent-clipped=0.0 2023-06-19 21:55:29,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=423474.0, ans=0.04949747468305833 2023-06-19 21:55:34,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=423474.0, ans=0.125 2023-06-19 21:55:39,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=423474.0, ans=0.125 2023-06-19 21:55:41,587 INFO [train.py:996] (0/4) Epoch 3, batch 9600, loss[loss=0.2429, simple_loss=0.3077, pruned_loss=0.08907, over 21711.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3306, pruned_loss=0.09977, over 4280414.76 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:56:11,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=423594.0, ans=0.125 2023-06-19 21:56:32,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-19 21:56:53,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=423654.0, ans=0.125 2023-06-19 21:57:31,852 INFO [train.py:996] (0/4) Epoch 3, batch 9650, loss[loss=0.2595, simple_loss=0.3293, pruned_loss=0.09487, over 21741.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3302, pruned_loss=0.09838, over 4284966.73 frames. ], batch size: 298, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 21:58:00,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423894.0, ans=0.1 2023-06-19 21:59:07,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.684e+02 3.158e+02 3.769e+02 7.574e+02, threshold=6.315e+02, percent-clipped=3.0 2023-06-19 21:59:08,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=424014.0, ans=10.0 2023-06-19 21:59:53,770 INFO [train.py:996] (0/4) Epoch 3, batch 9700, loss[loss=0.2523, simple_loss=0.3342, pruned_loss=0.08517, over 21864.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.333, pruned_loss=0.09904, over 4282932.10 frames. ], batch size: 371, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:00:22,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=424194.0, ans=0.125 2023-06-19 22:00:41,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424254.0, ans=0.125 2023-06-19 22:00:42,498 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:01:08,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-19 22:01:38,962 INFO [train.py:996] (0/4) Epoch 3, batch 9750, loss[loss=0.2561, simple_loss=0.3084, pruned_loss=0.1019, over 21766.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3287, pruned_loss=0.09784, over 4282729.12 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:01:57,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=424434.0, ans=0.2 2023-06-19 22:02:13,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=15.0 2023-06-19 22:02:16,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=424494.0, ans=0.0 2023-06-19 22:02:26,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=424554.0, ans=0.2 2023-06-19 22:02:44,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.697e+02 3.201e+02 3.859e+02 6.121e+02, threshold=6.401e+02, percent-clipped=0.0 2023-06-19 22:02:53,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=424674.0, ans=0.2 2023-06-19 22:03:13,841 INFO [train.py:996] (0/4) Epoch 3, batch 9800, loss[loss=0.2786, simple_loss=0.3379, pruned_loss=0.1096, over 21725.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3282, pruned_loss=0.09803, over 4281383.81 frames. ], batch size: 389, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:04:06,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=424854.0, ans=0.0 2023-06-19 22:04:45,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=424974.0, ans=0.125 2023-06-19 22:04:47,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=424974.0, ans=0.2 2023-06-19 22:04:53,075 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:04:59,793 INFO [train.py:996] (0/4) Epoch 3, batch 9850, loss[loss=0.2205, simple_loss=0.2737, pruned_loss=0.08364, over 21311.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3248, pruned_loss=0.09802, over 4277011.75 frames. ], batch size: 548, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:05:15,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=15.0 2023-06-19 22:05:53,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-19 22:06:05,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=425154.0, ans=0.125 2023-06-19 22:06:23,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.545e+02 2.826e+02 3.275e+02 4.693e+02, threshold=5.651e+02, percent-clipped=0.0 2023-06-19 22:06:27,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=425214.0, ans=0.125 2023-06-19 22:06:29,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=425214.0, ans=0.0 2023-06-19 22:07:09,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=425274.0, ans=0.2 2023-06-19 22:07:14,426 INFO [train.py:996] (0/4) Epoch 3, batch 9900, loss[loss=0.3089, simple_loss=0.361, pruned_loss=0.1284, over 21371.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3204, pruned_loss=0.09707, over 4265914.82 frames. ], batch size: 471, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:07:50,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=425394.0, ans=0.04949747468305833 2023-06-19 22:07:58,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=425454.0, ans=0.125 2023-06-19 22:08:30,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=425574.0, ans=0.0 2023-06-19 22:08:51,745 INFO [train.py:996] (0/4) Epoch 3, batch 9950, loss[loss=0.2377, simple_loss=0.3056, pruned_loss=0.08489, over 21799.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3213, pruned_loss=0.09862, over 4257465.48 frames. ], batch size: 118, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:09:18,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=425694.0, ans=0.0 2023-06-19 22:09:41,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-19 22:10:03,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=425814.0, ans=0.2 2023-06-19 22:10:11,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.786e+02 3.203e+02 4.575e+02 9.878e+02, threshold=6.406e+02, percent-clipped=15.0 2023-06-19 22:10:54,983 INFO [train.py:996] (0/4) Epoch 3, batch 10000, loss[loss=0.2576, simple_loss=0.322, pruned_loss=0.09662, over 20716.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3167, pruned_loss=0.09711, over 4254669.60 frames. ], batch size: 608, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:11:29,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=425994.0, ans=0.125 2023-06-19 22:11:43,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=426054.0, ans=10.0 2023-06-19 22:11:55,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=426054.0, ans=0.0 2023-06-19 22:13:05,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426234.0, ans=0.1 2023-06-19 22:13:06,296 INFO [train.py:996] (0/4) Epoch 3, batch 10050, loss[loss=0.1918, simple_loss=0.2741, pruned_loss=0.05479, over 21428.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3199, pruned_loss=0.09802, over 4259895.49 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:13:21,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=426234.0, ans=0.0 2023-06-19 22:13:39,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426294.0, ans=0.125 2023-06-19 22:14:06,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=426354.0, ans=0.025 2023-06-19 22:14:22,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.478e+02 2.837e+02 3.285e+02 4.855e+02, threshold=5.673e+02, percent-clipped=0.0 2023-06-19 22:15:05,170 INFO [train.py:996] (0/4) Epoch 3, batch 10100, loss[loss=0.3032, simple_loss=0.3548, pruned_loss=0.1258, over 21388.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3167, pruned_loss=0.09482, over 4263876.03 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:17:13,930 INFO [train.py:996] (0/4) Epoch 3, batch 10150, loss[loss=0.2723, simple_loss=0.3479, pruned_loss=0.0984, over 21407.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3256, pruned_loss=0.09821, over 4260691.55 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:17:15,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=426834.0, ans=0.125 2023-06-19 22:17:18,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=426834.0, ans=0.125 2023-06-19 22:17:37,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.64 vs. limit=6.0 2023-06-19 22:18:08,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-19 22:18:15,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.616e+02 3.111e+02 3.928e+02 6.144e+02, threshold=6.222e+02, percent-clipped=2.0 2023-06-19 22:18:28,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=427014.0, ans=0.125 2023-06-19 22:18:41,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=427074.0, ans=0.125 2023-06-19 22:18:45,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=427074.0, ans=0.2 2023-06-19 22:18:54,599 INFO [train.py:996] (0/4) Epoch 3, batch 10200, loss[loss=0.3202, simple_loss=0.3743, pruned_loss=0.1331, over 21374.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3238, pruned_loss=0.09567, over 4267160.14 frames. ], batch size: 507, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:19:14,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427194.0, ans=0.1 2023-06-19 22:19:17,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=427194.0, ans=0.025 2023-06-19 22:19:46,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=427254.0, ans=0.0 2023-06-19 22:20:43,641 INFO [train.py:996] (0/4) Epoch 3, batch 10250, loss[loss=0.2629, simple_loss=0.3383, pruned_loss=0.09377, over 21587.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.316, pruned_loss=0.0884, over 4274734.28 frames. ], batch size: 389, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:20:57,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=427494.0, ans=0.125 2023-06-19 22:21:57,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.458e+02 2.783e+02 3.278e+02 6.025e+02, threshold=5.566e+02, percent-clipped=0.0 2023-06-19 22:22:34,959 INFO [train.py:996] (0/4) Epoch 3, batch 10300, loss[loss=0.2864, simple_loss=0.3449, pruned_loss=0.1139, over 21345.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3198, pruned_loss=0.09026, over 4277337.15 frames. ], batch size: 549, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:22:43,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=427734.0, ans=0.125 2023-06-19 22:22:49,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=427794.0, ans=0.0 2023-06-19 22:23:13,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=427794.0, ans=0.0 2023-06-19 22:23:16,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=427854.0, ans=0.125 2023-06-19 22:23:19,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-19 22:23:37,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=427854.0, ans=0.125 2023-06-19 22:23:58,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-19 22:24:22,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=427974.0, ans=0.125 2023-06-19 22:24:35,502 INFO [train.py:996] (0/4) Epoch 3, batch 10350, loss[loss=0.277, simple_loss=0.3481, pruned_loss=0.1029, over 21564.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3223, pruned_loss=0.09047, over 4279841.00 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 22:24:38,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428034.0, ans=0.1 2023-06-19 22:24:53,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-19 22:26:05,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.731e+02 3.156e+02 3.865e+02 6.319e+02, threshold=6.313e+02, percent-clipped=4.0 2023-06-19 22:26:10,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=428214.0, ans=0.0 2023-06-19 22:26:38,496 INFO [train.py:996] (0/4) Epoch 3, batch 10400, loss[loss=0.2137, simple_loss=0.2763, pruned_loss=0.07556, over 21787.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3189, pruned_loss=0.08923, over 4273871.18 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:26:49,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428334.0, ans=0.1 2023-06-19 22:27:09,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=428394.0, ans=0.125 2023-06-19 22:27:19,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428394.0, ans=0.1 2023-06-19 22:27:58,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=10.0 2023-06-19 22:28:03,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-19 22:28:07,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=428514.0, ans=0.125 2023-06-19 22:28:45,451 INFO [train.py:996] (0/4) Epoch 3, batch 10450, loss[loss=0.2697, simple_loss=0.3301, pruned_loss=0.1046, over 21181.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3243, pruned_loss=0.0929, over 4273724.79 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:29:28,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=428694.0, ans=0.2 2023-06-19 22:30:01,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=428754.0, ans=0.125 2023-06-19 22:30:16,369 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.823e+02 3.547e+02 4.478e+02 9.217e+02, threshold=7.094e+02, percent-clipped=7.0 2023-06-19 22:30:55,063 INFO [train.py:996] (0/4) Epoch 3, batch 10500, loss[loss=0.2143, simple_loss=0.3121, pruned_loss=0.05824, over 19748.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3238, pruned_loss=0.09228, over 4274305.60 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:31:06,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428934.0, ans=0.0 2023-06-19 22:31:37,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-19 22:31:46,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-19 22:32:12,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=429114.0, ans=0.0 2023-06-19 22:32:15,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=429174.0, ans=0.2 2023-06-19 22:32:45,447 INFO [train.py:996] (0/4) Epoch 3, batch 10550, loss[loss=0.2654, simple_loss=0.3617, pruned_loss=0.08454, over 20860.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3168, pruned_loss=0.09149, over 4267229.47 frames. ], batch size: 608, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:33:03,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429234.0, ans=0.1 2023-06-19 22:33:04,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429234.0, ans=0.1 2023-06-19 22:33:10,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-19 22:33:12,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-19 22:34:03,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.386e+02 2.842e+02 3.360e+02 5.942e+02, threshold=5.684e+02, percent-clipped=0.0 2023-06-19 22:34:49,234 INFO [train.py:996] (0/4) Epoch 3, batch 10600, loss[loss=0.2287, simple_loss=0.3179, pruned_loss=0.06975, over 21764.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3118, pruned_loss=0.08964, over 4260051.45 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 22:34:49,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=429534.0, ans=0.2 2023-06-19 22:35:02,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=429534.0, ans=0.125 2023-06-19 22:36:21,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=429774.0, ans=0.125 2023-06-19 22:36:52,424 INFO [train.py:996] (0/4) Epoch 3, batch 10650, loss[loss=0.1819, simple_loss=0.248, pruned_loss=0.05793, over 21307.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3157, pruned_loss=0.08908, over 4267696.65 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:37:17,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=429894.0, ans=0.2 2023-06-19 22:37:22,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-19 22:38:19,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.791e+02 4.826e+02 9.694e+02, threshold=7.582e+02, percent-clipped=13.0 2023-06-19 22:38:35,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-19 22:38:52,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.81 vs. limit=10.0 2023-06-19 22:38:56,295 INFO [train.py:996] (0/4) Epoch 3, batch 10700, loss[loss=0.2861, simple_loss=0.3331, pruned_loss=0.1195, over 20038.00 frames. ], tot_loss[loss=0.246, simple_loss=0.314, pruned_loss=0.08904, over 4253057.32 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:39:14,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=430134.0, ans=10.0 2023-06-19 22:39:23,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 22:39:39,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=430254.0, ans=0.0 2023-06-19 22:41:02,127 INFO [train.py:996] (0/4) Epoch 3, batch 10750, loss[loss=0.3241, simple_loss=0.4054, pruned_loss=0.1215, over 21750.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3241, pruned_loss=0.0939, over 4258115.23 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:41:09,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-19 22:41:43,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=430494.0, ans=0.0 2023-06-19 22:41:50,721 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:42:11,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=430614.0, ans=0.125 2023-06-19 22:42:18,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=430614.0, ans=0.0 2023-06-19 22:42:19,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=430614.0, ans=0.125 2023-06-19 22:42:22,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.840e+02 3.445e+02 4.011e+02 6.659e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 22:42:27,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430614.0, ans=0.1 2023-06-19 22:42:38,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-19 22:42:49,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=430674.0, ans=0.125 2023-06-19 22:43:03,920 INFO [train.py:996] (0/4) Epoch 3, batch 10800, loss[loss=0.283, simple_loss=0.3514, pruned_loss=0.1073, over 21361.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3304, pruned_loss=0.09562, over 4265944.92 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:45:01,154 INFO [train.py:996] (0/4) Epoch 3, batch 10850, loss[loss=0.2304, simple_loss=0.2914, pruned_loss=0.08473, over 21567.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3313, pruned_loss=0.09596, over 4260525.93 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:45:11,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431034.0, ans=0.1 2023-06-19 22:46:22,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 2.541e+02 3.011e+02 3.427e+02 5.318e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-19 22:46:31,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-19 22:46:45,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=431334.0, ans=0.2 2023-06-19 22:46:46,204 INFO [train.py:996] (0/4) Epoch 3, batch 10900, loss[loss=0.2398, simple_loss=0.3278, pruned_loss=0.07586, over 21718.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3244, pruned_loss=0.09372, over 4256989.21 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:46:50,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-19 22:46:53,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=431334.0, ans=0.0 2023-06-19 22:48:28,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=22.5 2023-06-19 22:48:33,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=431574.0, ans=0.125 2023-06-19 22:48:42,865 INFO [train.py:996] (0/4) Epoch 3, batch 10950, loss[loss=0.2421, simple_loss=0.2999, pruned_loss=0.09213, over 21251.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.319, pruned_loss=0.09102, over 4255847.91 frames. ], batch size: 144, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:49:05,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431634.0, ans=0.1 2023-06-19 22:49:06,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-19 22:49:14,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=431694.0, ans=0.0 2023-06-19 22:49:25,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431694.0, ans=0.1 2023-06-19 22:49:42,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431754.0, ans=0.1 2023-06-19 22:50:08,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.20 vs. limit=6.0 2023-06-19 22:50:09,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.597e+02 3.255e+02 3.821e+02 6.519e+02, threshold=6.510e+02, percent-clipped=2.0 2023-06-19 22:50:16,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431874.0, ans=0.1 2023-06-19 22:50:45,267 INFO [train.py:996] (0/4) Epoch 3, batch 11000, loss[loss=0.2504, simple_loss=0.3097, pruned_loss=0.09551, over 21701.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3186, pruned_loss=0.09201, over 4249939.21 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:51:08,860 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-72000.pt 2023-06-19 22:51:37,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=432054.0, ans=0.2 2023-06-19 22:51:43,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=432054.0, ans=0.125 2023-06-19 22:51:45,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432054.0, ans=0.1 2023-06-19 22:52:12,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=432114.0, ans=0.0 2023-06-19 22:52:15,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=15.0 2023-06-19 22:52:26,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=432174.0, ans=0.125 2023-06-19 22:52:33,734 INFO [train.py:996] (0/4) Epoch 3, batch 11050, loss[loss=0.2184, simple_loss=0.2745, pruned_loss=0.08116, over 21574.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3173, pruned_loss=0.09389, over 4240127.00 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:53:46,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.920e+02 3.396e+02 4.201e+02 8.716e+02, threshold=6.791e+02, percent-clipped=3.0 2023-06-19 22:53:53,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=432414.0, ans=0.0 2023-06-19 22:53:55,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=432474.0, ans=0.125 2023-06-19 22:54:02,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=432474.0, ans=0.2 2023-06-19 22:54:10,306 INFO [train.py:996] (0/4) Epoch 3, batch 11100, loss[loss=0.3016, simple_loss=0.351, pruned_loss=0.1261, over 21594.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3165, pruned_loss=0.09462, over 4234666.09 frames. ], batch size: 414, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:54:21,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-19 22:54:52,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=432594.0, ans=0.0 2023-06-19 22:55:55,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=432714.0, ans=0.2 2023-06-19 22:56:06,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=432774.0, ans=0.125 2023-06-19 22:56:13,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=432834.0, ans=0.125 2023-06-19 22:56:14,883 INFO [train.py:996] (0/4) Epoch 3, batch 11150, loss[loss=0.2323, simple_loss=0.3243, pruned_loss=0.07018, over 21701.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3144, pruned_loss=0.09423, over 4238575.69 frames. ], batch size: 332, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:57:15,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432954.0, ans=0.1 2023-06-19 22:57:40,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-19 22:57:41,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.651e+02 2.885e+02 3.515e+02 7.934e+02, threshold=5.769e+02, percent-clipped=2.0 2023-06-19 22:58:09,301 INFO [train.py:996] (0/4) Epoch 3, batch 11200, loss[loss=0.23, simple_loss=0.2905, pruned_loss=0.08475, over 21660.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3128, pruned_loss=0.09393, over 4249770.82 frames. ], batch size: 333, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 22:58:30,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433194.0, ans=0.1 2023-06-19 22:58:50,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-19 22:59:09,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=433254.0, ans=0.125 2023-06-19 23:00:02,714 INFO [train.py:996] (0/4) Epoch 3, batch 11250, loss[loss=0.2585, simple_loss=0.3412, pruned_loss=0.08795, over 21800.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3115, pruned_loss=0.09325, over 4256754.73 frames. ], batch size: 118, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:00:31,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=433494.0, ans=0.2 2023-06-19 23:00:34,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=433494.0, ans=0.0 2023-06-19 23:01:06,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=433554.0, ans=0.0 2023-06-19 23:01:40,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.431e+02 2.742e+02 3.352e+02 9.461e+02, threshold=5.484e+02, percent-clipped=5.0 2023-06-19 23:02:04,707 INFO [train.py:996] (0/4) Epoch 3, batch 11300, loss[loss=0.2479, simple_loss=0.3114, pruned_loss=0.09225, over 21304.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3125, pruned_loss=0.09321, over 4265158.93 frames. ], batch size: 159, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:02:06,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=433734.0, ans=0.125 2023-06-19 23:02:09,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=433734.0, ans=0.125 2023-06-19 23:02:13,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-19 23:02:50,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=433854.0, ans=0.0 2023-06-19 23:02:52,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433854.0, ans=0.1 2023-06-19 23:04:00,913 INFO [train.py:996] (0/4) Epoch 3, batch 11350, loss[loss=0.2306, simple_loss=0.3127, pruned_loss=0.07423, over 21631.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3147, pruned_loss=0.09212, over 4269130.45 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:04:25,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=434094.0, ans=0.0 2023-06-19 23:05:25,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.877e+02 3.365e+02 4.461e+02 8.303e+02, threshold=6.730e+02, percent-clipped=12.0 2023-06-19 23:05:52,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=434274.0, ans=0.125 2023-06-19 23:06:03,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=434274.0, ans=15.0 2023-06-19 23:06:04,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=434334.0, ans=0.2 2023-06-19 23:06:05,029 INFO [train.py:996] (0/4) Epoch 3, batch 11400, loss[loss=0.2387, simple_loss=0.2929, pruned_loss=0.09224, over 21257.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3189, pruned_loss=0.09396, over 4275729.54 frames. ], batch size: 608, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:07:08,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=434454.0, ans=0.2 2023-06-19 23:07:23,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=434514.0, ans=0.0 2023-06-19 23:07:30,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=434514.0, ans=0.125 2023-06-19 23:08:04,088 INFO [train.py:996] (0/4) Epoch 3, batch 11450, loss[loss=0.1995, simple_loss=0.2782, pruned_loss=0.06041, over 21277.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3203, pruned_loss=0.09243, over 4281090.98 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:09:31,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=434814.0, ans=0.0 2023-06-19 23:09:32,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.596e+02 3.109e+02 3.660e+02 5.880e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-19 23:10:05,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=434874.0, ans=0.125 2023-06-19 23:10:08,102 INFO [train.py:996] (0/4) Epoch 3, batch 11500, loss[loss=0.2227, simple_loss=0.3111, pruned_loss=0.06714, over 21857.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3242, pruned_loss=0.0939, over 4280241.35 frames. ], batch size: 316, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:10:34,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=434934.0, ans=0.125 2023-06-19 23:10:45,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=434994.0, ans=0.125 2023-06-19 23:11:01,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=434994.0, ans=0.04949747468305833 2023-06-19 23:12:09,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=435174.0, ans=0.125 2023-06-19 23:12:15,585 INFO [train.py:996] (0/4) Epoch 3, batch 11550, loss[loss=0.2496, simple_loss=0.3228, pruned_loss=0.08817, over 19966.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3286, pruned_loss=0.09369, over 4271629.38 frames. ], batch size: 704, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:12:19,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=435234.0, ans=0.2 2023-06-19 23:12:20,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=435234.0, ans=0.0 2023-06-19 23:13:54,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.665e+02 3.265e+02 4.118e+02 8.231e+02, threshold=6.531e+02, percent-clipped=5.0 2023-06-19 23:14:19,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435474.0, ans=0.0 2023-06-19 23:14:29,451 INFO [train.py:996] (0/4) Epoch 3, batch 11600, loss[loss=0.2835, simple_loss=0.3805, pruned_loss=0.09322, over 21674.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3437, pruned_loss=0.09528, over 4269859.31 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:14:33,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435534.0, ans=0.1 2023-06-19 23:14:33,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=435534.0, ans=0.09899494936611666 2023-06-19 23:14:33,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=435534.0, ans=0.125 2023-06-19 23:14:37,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=8.0 2023-06-19 23:15:14,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=435654.0, ans=10.0 2023-06-19 23:15:24,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435714.0, ans=0.0 2023-06-19 23:16:14,738 INFO [train.py:996] (0/4) Epoch 3, batch 11650, loss[loss=0.2613, simple_loss=0.3338, pruned_loss=0.09444, over 21561.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3498, pruned_loss=0.09667, over 4272534.72 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:17:47,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.535e+02 2.971e+02 3.665e+02 6.378e+02, threshold=5.943e+02, percent-clipped=0.0 2023-06-19 23:17:49,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=436014.0, ans=0.125 2023-06-19 23:17:59,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-19 23:18:02,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=436074.0, ans=0.0 2023-06-19 23:18:18,003 INFO [train.py:996] (0/4) Epoch 3, batch 11700, loss[loss=0.2354, simple_loss=0.2937, pruned_loss=0.08861, over 21618.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3407, pruned_loss=0.09559, over 4272331.44 frames. ], batch size: 332, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:18:21,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=436134.0, ans=0.09899494936611666 2023-06-19 23:19:19,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-19 23:19:34,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=436374.0, ans=0.125 2023-06-19 23:19:46,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=436374.0, ans=15.0 2023-06-19 23:19:50,332 INFO [train.py:996] (0/4) Epoch 3, batch 11750, loss[loss=0.2733, simple_loss=0.3422, pruned_loss=0.1022, over 21416.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3305, pruned_loss=0.09483, over 4275157.10 frames. ], batch size: 131, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:20:20,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=436494.0, ans=0.07 2023-06-19 23:20:31,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=436554.0, ans=0.0 2023-06-19 23:20:36,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=436554.0, ans=0.125 2023-06-19 23:20:53,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 3.123e+02 3.553e+02 5.230e+02, threshold=6.245e+02, percent-clipped=0.0 2023-06-19 23:21:20,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=436674.0, ans=0.125 2023-06-19 23:21:23,186 INFO [train.py:996] (0/4) Epoch 3, batch 11800, loss[loss=0.229, simple_loss=0.3205, pruned_loss=0.06873, over 21380.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3344, pruned_loss=0.09908, over 4280426.14 frames. ], batch size: 211, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:21:36,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=436734.0, ans=0.0 2023-06-19 23:23:16,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-19 23:23:33,153 INFO [train.py:996] (0/4) Epoch 3, batch 11850, loss[loss=0.2698, simple_loss=0.363, pruned_loss=0.08829, over 20789.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3353, pruned_loss=0.09719, over 4288519.75 frames. ], batch size: 607, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:23:41,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-19 23:24:54,513 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.604e+02 3.025e+02 3.520e+02 5.085e+02, threshold=6.049e+02, percent-clipped=0.0 2023-06-19 23:25:30,665 INFO [train.py:996] (0/4) Epoch 3, batch 11900, loss[loss=0.2148, simple_loss=0.3011, pruned_loss=0.06427, over 21751.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3333, pruned_loss=0.09415, over 4289625.13 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 23:26:26,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=437454.0, ans=0.2 2023-06-19 23:26:37,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=437514.0, ans=0.0 2023-06-19 23:26:37,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=437514.0, ans=0.05 2023-06-19 23:26:45,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-19 23:26:46,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 23:26:47,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=437574.0, ans=0.0 2023-06-19 23:27:05,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.69 vs. limit=22.5 2023-06-19 23:27:08,736 INFO [train.py:996] (0/4) Epoch 3, batch 11950, loss[loss=0.2209, simple_loss=0.286, pruned_loss=0.07792, over 21804.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3323, pruned_loss=0.09049, over 4282447.25 frames. ], batch size: 102, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:28:28,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.557e+02 3.205e+02 3.971e+02 7.967e+02, threshold=6.411e+02, percent-clipped=3.0 2023-06-19 23:28:58,288 INFO [train.py:996] (0/4) Epoch 3, batch 12000, loss[loss=0.2374, simple_loss=0.2977, pruned_loss=0.08857, over 21842.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3252, pruned_loss=0.08842, over 4275895.11 frames. ], batch size: 318, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:28:58,289 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 23:29:56,189 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2725, simple_loss=0.3684, pruned_loss=0.08831, over 1796401.00 frames. 2023-06-19 23:29:56,191 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-19 23:30:30,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=437994.0, ans=0.125 2023-06-19 23:30:30,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=12.0 2023-06-19 23:31:14,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=438114.0, ans=0.125 2023-06-19 23:31:44,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-19 23:31:52,700 INFO [train.py:996] (0/4) Epoch 3, batch 12050, loss[loss=0.2566, simple_loss=0.3135, pruned_loss=0.09983, over 21893.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3237, pruned_loss=0.0905, over 4273143.07 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:33:06,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.716e+02 3.222e+02 3.934e+02 6.549e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-19 23:33:06,638 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:33:45,636 INFO [train.py:996] (0/4) Epoch 3, batch 12100, loss[loss=0.2349, simple_loss=0.2865, pruned_loss=0.09162, over 20767.00 frames. ], tot_loss[loss=0.26, simple_loss=0.33, pruned_loss=0.09498, over 4273071.13 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:35:03,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=438714.0, ans=0.125 2023-06-19 23:35:11,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=438714.0, ans=0.125 2023-06-19 23:35:23,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=438714.0, ans=0.0 2023-06-19 23:36:08,829 INFO [train.py:996] (0/4) Epoch 3, batch 12150, loss[loss=0.2826, simple_loss=0.335, pruned_loss=0.1151, over 20662.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3358, pruned_loss=0.09589, over 4276749.99 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:36:09,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=438834.0, ans=0.2 2023-06-19 23:36:30,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=438894.0, ans=0.2 2023-06-19 23:37:37,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-19 23:37:37,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.132e+02 3.694e+02 4.434e+02 6.753e+02, threshold=7.387e+02, percent-clipped=1.0 2023-06-19 23:37:59,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=439074.0, ans=0.5 2023-06-19 23:38:11,576 INFO [train.py:996] (0/4) Epoch 3, batch 12200, loss[loss=0.2215, simple_loss=0.2796, pruned_loss=0.08166, over 21680.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.332, pruned_loss=0.09408, over 4273055.51 frames. ], batch size: 299, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:38:18,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=439134.0, ans=10.0 2023-06-19 23:39:01,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439314.0, ans=0.1 2023-06-19 23:39:37,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-19 23:40:09,931 INFO [train.py:996] (0/4) Epoch 3, batch 12250, loss[loss=0.1871, simple_loss=0.2648, pruned_loss=0.05467, over 21556.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3229, pruned_loss=0.09042, over 4277511.10 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:40:11,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=439434.0, ans=0.0 2023-06-19 23:40:34,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=439494.0, ans=0.2 2023-06-19 23:40:57,060 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:41:12,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 2.432e+02 2.824e+02 3.455e+02 5.989e+02, threshold=5.649e+02, percent-clipped=0.0 2023-06-19 23:41:34,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=439674.0, ans=0.2 2023-06-19 23:41:39,431 INFO [train.py:996] (0/4) Epoch 3, batch 12300, loss[loss=0.2231, simple_loss=0.3111, pruned_loss=0.06756, over 21740.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3146, pruned_loss=0.08361, over 4279528.83 frames. ], batch size: 332, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:42:29,818 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:43:29,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=439974.0, ans=0.0 2023-06-19 23:43:37,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-19 23:43:45,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439974.0, ans=0.1 2023-06-19 23:43:49,785 INFO [train.py:996] (0/4) Epoch 3, batch 12350, loss[loss=0.2762, simple_loss=0.3618, pruned_loss=0.09537, over 21338.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3197, pruned_loss=0.08469, over 4279794.52 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:44:07,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=440034.0, ans=0.1 2023-06-19 23:44:19,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-19 23:44:32,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=440154.0, ans=0.2 2023-06-19 23:44:37,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=440154.0, ans=0.5 2023-06-19 23:44:42,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-19 23:44:51,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.628e+02 3.014e+02 3.788e+02 7.072e+02, threshold=6.028e+02, percent-clipped=4.0 2023-06-19 23:45:31,938 INFO [train.py:996] (0/4) Epoch 3, batch 12400, loss[loss=0.2458, simple_loss=0.3155, pruned_loss=0.08806, over 21549.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3221, pruned_loss=0.08938, over 4281954.35 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:46:08,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-19 23:46:13,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=440394.0, ans=0.0 2023-06-19 23:46:47,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=440454.0, ans=0.025 2023-06-19 23:46:55,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440514.0, ans=0.125 2023-06-19 23:47:08,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-19 23:47:42,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=15.0 2023-06-19 23:47:47,501 INFO [train.py:996] (0/4) Epoch 3, batch 12450, loss[loss=0.2935, simple_loss=0.3542, pruned_loss=0.1164, over 21472.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3262, pruned_loss=0.0933, over 4287706.06 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:48:01,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=440634.0, ans=0.1 2023-06-19 23:48:17,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=440694.0, ans=0.04949747468305833 2023-06-19 23:48:17,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=440694.0, ans=0.2 2023-06-19 23:48:18,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440694.0, ans=0.0 2023-06-19 23:48:33,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=440754.0, ans=0.025 2023-06-19 23:48:39,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=440754.0, ans=0.125 2023-06-19 23:48:57,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.789e+02 3.094e+02 3.461e+02 5.452e+02, threshold=6.188e+02, percent-clipped=0.0 2023-06-19 23:49:31,673 INFO [train.py:996] (0/4) Epoch 3, batch 12500, loss[loss=0.3064, simple_loss=0.3805, pruned_loss=0.1162, over 21368.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3388, pruned_loss=0.09778, over 4282864.11 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 23:49:32,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=440934.0, ans=0.125 2023-06-19 23:49:34,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=440934.0, ans=0.0 2023-06-19 23:50:17,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=440994.0, ans=0.0 2023-06-19 23:50:24,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440994.0, ans=0.1 2023-06-19 23:50:46,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=12.0 2023-06-19 23:50:47,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-19 23:51:43,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=441174.0, ans=0.04949747468305833 2023-06-19 23:51:48,447 INFO [train.py:996] (0/4) Epoch 3, batch 12550, loss[loss=0.2686, simple_loss=0.343, pruned_loss=0.09706, over 21976.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.343, pruned_loss=0.1003, over 4278658.79 frames. ], batch size: 317, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:52:04,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=441234.0, ans=0.125 2023-06-19 23:52:29,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=441294.0, ans=0.2 2023-06-19 23:52:53,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=441354.0, ans=10.0 2023-06-19 23:53:03,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=15.0 2023-06-19 23:53:15,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.900e+02 3.527e+02 4.002e+02 7.299e+02, threshold=7.054e+02, percent-clipped=3.0 2023-06-19 23:53:56,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441474.0, ans=0.125 2023-06-19 23:53:58,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=441474.0, ans=0.0 2023-06-19 23:54:00,537 INFO [train.py:996] (0/4) Epoch 3, batch 12600, loss[loss=0.202, simple_loss=0.2803, pruned_loss=0.06183, over 21510.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3422, pruned_loss=0.0978, over 4280130.32 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:54:48,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-19 23:54:50,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=441654.0, ans=0.2 2023-06-19 23:55:20,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441714.0, ans=0.125 2023-06-19 23:55:49,630 INFO [train.py:996] (0/4) Epoch 3, batch 12650, loss[loss=0.2612, simple_loss=0.3248, pruned_loss=0.09876, over 21849.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.331, pruned_loss=0.09232, over 4274469.22 frames. ], batch size: 391, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:55:50,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=441834.0, ans=0.0 2023-06-19 23:56:20,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=441894.0, ans=0.2 2023-06-19 23:56:24,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441894.0, ans=0.125 2023-06-19 23:56:35,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441954.0, ans=0.125 2023-06-19 23:56:40,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=441954.0, ans=0.0 2023-06-19 23:57:05,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.355e+02 2.861e+02 3.372e+02 5.218e+02, threshold=5.723e+02, percent-clipped=0.0 2023-06-19 23:57:08,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=442014.0, ans=0.125 2023-06-19 23:57:37,730 INFO [train.py:996] (0/4) Epoch 3, batch 12700, loss[loss=0.2817, simple_loss=0.3453, pruned_loss=0.109, over 21494.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3321, pruned_loss=0.0955, over 4279710.28 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 23:57:58,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-19 23:58:57,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442314.0, ans=0.125 2023-06-19 23:59:00,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-19 23:59:03,124 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:59:35,040 INFO [train.py:996] (0/4) Epoch 3, batch 12750, loss[loss=0.2347, simple_loss=0.3175, pruned_loss=0.07593, over 21778.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3351, pruned_loss=0.09723, over 4275354.57 frames. ], batch size: 282, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:00:50,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=442614.0, ans=0.0 2023-06-20 00:01:08,596 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.647e+02 3.084e+02 3.611e+02 5.455e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:01:11,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-20 00:01:43,925 INFO [train.py:996] (0/4) Epoch 3, batch 12800, loss[loss=0.2453, simple_loss=0.3174, pruned_loss=0.08658, over 21641.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3344, pruned_loss=0.09784, over 4273859.05 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:01:46,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-20 00:01:50,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=442734.0, ans=0.0 2023-06-20 00:02:54,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=442914.0, ans=0.07 2023-06-20 00:03:41,815 INFO [train.py:996] (0/4) Epoch 3, batch 12850, loss[loss=0.2861, simple_loss=0.3626, pruned_loss=0.1048, over 21749.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3362, pruned_loss=0.09935, over 4276484.52 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:04:21,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-20 00:04:24,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=443154.0, ans=0.025 2023-06-20 00:04:25,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=443154.0, ans=0.0 2023-06-20 00:05:19,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.657e+02 3.038e+02 3.599e+02 6.295e+02, threshold=6.075e+02, percent-clipped=1.0 2023-06-20 00:05:49,487 INFO [train.py:996] (0/4) Epoch 3, batch 12900, loss[loss=0.2818, simple_loss=0.3524, pruned_loss=0.1056, over 20664.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3337, pruned_loss=0.09561, over 4275840.60 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:06:45,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-20 00:07:13,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443514.0, ans=0.1 2023-06-20 00:07:58,457 INFO [train.py:996] (0/4) Epoch 3, batch 12950, loss[loss=0.3014, simple_loss=0.3622, pruned_loss=0.1203, over 21453.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3315, pruned_loss=0.0938, over 4276548.69 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:08:06,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=443634.0, ans=10.0 2023-06-20 00:08:10,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=443634.0, ans=0.2 2023-06-20 00:09:24,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-06-20 00:09:30,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.524e+02 2.816e+02 3.145e+02 4.812e+02, threshold=5.631e+02, percent-clipped=0.0 2023-06-20 00:09:57,392 INFO [train.py:996] (0/4) Epoch 3, batch 13000, loss[loss=0.2376, simple_loss=0.3204, pruned_loss=0.07737, over 21583.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3305, pruned_loss=0.09314, over 4282115.55 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:11:05,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-20 00:12:05,457 INFO [train.py:996] (0/4) Epoch 3, batch 13050, loss[loss=0.2498, simple_loss=0.3147, pruned_loss=0.09239, over 21693.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3252, pruned_loss=0.09039, over 4284064.03 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:13:02,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=444414.0, ans=0.125 2023-06-20 00:13:02,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-20 00:13:15,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=444414.0, ans=0.0 2023-06-20 00:13:18,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 2.600e+02 3.229e+02 4.013e+02 6.675e+02, threshold=6.458e+02, percent-clipped=7.0 2023-06-20 00:13:48,112 INFO [train.py:996] (0/4) Epoch 3, batch 13100, loss[loss=0.2757, simple_loss=0.344, pruned_loss=0.1037, over 21360.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3269, pruned_loss=0.09105, over 4290462.70 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:13:57,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=444534.0, ans=0.125 2023-06-20 00:14:20,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444594.0, ans=0.125 2023-06-20 00:14:44,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=444594.0, ans=0.2 2023-06-20 00:15:11,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=444714.0, ans=0.125 2023-06-20 00:15:24,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=444714.0, ans=0.07 2023-06-20 00:15:44,213 INFO [train.py:996] (0/4) Epoch 3, batch 13150, loss[loss=0.2431, simple_loss=0.3164, pruned_loss=0.08494, over 21742.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3274, pruned_loss=0.09342, over 4289796.82 frames. ], batch size: 352, lr: 1.11e-02, grad_scale: 16.0 2023-06-20 00:16:00,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=444834.0, ans=0.0 2023-06-20 00:16:02,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-20 00:17:23,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.624e+02 3.071e+02 3.652e+02 5.306e+02, threshold=6.141e+02, percent-clipped=0.0 2023-06-20 00:17:48,429 INFO [train.py:996] (0/4) Epoch 3, batch 13200, loss[loss=0.2696, simple_loss=0.3331, pruned_loss=0.103, over 21661.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3253, pruned_loss=0.09288, over 4294073.31 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:17:49,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=445134.0, ans=0.125 2023-06-20 00:18:26,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-20 00:18:33,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=445194.0, ans=0.0 2023-06-20 00:18:50,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=445254.0, ans=0.125 2023-06-20 00:19:16,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-20 00:19:46,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=445374.0, ans=0.0 2023-06-20 00:19:56,221 INFO [train.py:996] (0/4) Epoch 3, batch 13250, loss[loss=0.2823, simple_loss=0.3563, pruned_loss=0.1042, over 21805.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3266, pruned_loss=0.09451, over 4290068.80 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 32.0 2023-06-20 00:19:59,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=445434.0, ans=0.125 2023-06-20 00:20:30,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=445494.0, ans=10.0 2023-06-20 00:20:52,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445554.0, ans=0.1 2023-06-20 00:21:06,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-20 00:21:22,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=445614.0, ans=0.0 2023-06-20 00:21:33,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=445614.0, ans=0.0 2023-06-20 00:21:40,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.563e+02 2.914e+02 3.301e+02 4.888e+02, threshold=5.829e+02, percent-clipped=0.0 2023-06-20 00:22:12,924 INFO [train.py:996] (0/4) Epoch 3, batch 13300, loss[loss=0.2556, simple_loss=0.3272, pruned_loss=0.09201, over 21406.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3305, pruned_loss=0.09469, over 4288038.98 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:24:11,635 INFO [train.py:996] (0/4) Epoch 3, batch 13350, loss[loss=0.3295, simple_loss=0.3983, pruned_loss=0.1303, over 21623.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3354, pruned_loss=0.09844, over 4291758.64 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:24:13,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446034.0, ans=0.125 2023-06-20 00:24:42,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=446094.0, ans=0.0 2023-06-20 00:25:37,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=446214.0, ans=0.2 2023-06-20 00:25:50,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.814e+02 3.239e+02 3.845e+02 6.415e+02, threshold=6.478e+02, percent-clipped=2.0 2023-06-20 00:26:28,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-20 00:26:29,092 INFO [train.py:996] (0/4) Epoch 3, batch 13400, loss[loss=0.2848, simple_loss=0.3472, pruned_loss=0.1112, over 21750.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3367, pruned_loss=0.1001, over 4292209.51 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:26:38,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=446334.0, ans=0.125 2023-06-20 00:27:48,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=446514.0, ans=0.015 2023-06-20 00:28:25,122 INFO [train.py:996] (0/4) Epoch 3, batch 13450, loss[loss=0.2254, simple_loss=0.2828, pruned_loss=0.084, over 21531.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3374, pruned_loss=0.1022, over 4286597.17 frames. ], batch size: 230, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:28:42,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=446634.0, ans=0.0 2023-06-20 00:29:04,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=446694.0, ans=0.0 2023-06-20 00:29:21,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=446754.0, ans=0.125 2023-06-20 00:29:31,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=446754.0, ans=0.125 2023-06-20 00:29:32,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-20 00:29:33,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446754.0, ans=0.125 2023-06-20 00:29:45,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446814.0, ans=0.125 2023-06-20 00:29:56,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=446814.0, ans=0.125 2023-06-20 00:29:59,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.743e+02 3.193e+02 4.040e+02 8.828e+02, threshold=6.385e+02, percent-clipped=3.0 2023-06-20 00:30:34,586 INFO [train.py:996] (0/4) Epoch 3, batch 13500, loss[loss=0.3064, simple_loss=0.364, pruned_loss=0.1245, over 21509.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3281, pruned_loss=0.09903, over 4286681.39 frames. ], batch size: 473, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:31:17,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=446994.0, ans=0.125 2023-06-20 00:31:54,997 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:32:46,182 INFO [train.py:996] (0/4) Epoch 3, batch 13550, loss[loss=0.2626, simple_loss=0.3512, pruned_loss=0.08702, over 21229.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3337, pruned_loss=0.09836, over 4283335.35 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:33:04,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-20 00:33:34,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447294.0, ans=0.1 2023-06-20 00:33:48,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=447354.0, ans=0.5 2023-06-20 00:34:23,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.814e+02 3.268e+02 3.929e+02 6.587e+02, threshold=6.537e+02, percent-clipped=1.0 2023-06-20 00:34:59,314 INFO [train.py:996] (0/4) Epoch 3, batch 13600, loss[loss=0.2571, simple_loss=0.3161, pruned_loss=0.09908, over 21675.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3347, pruned_loss=0.09842, over 4284491.84 frames. ], batch size: 230, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:36:41,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=447774.0, ans=0.125 2023-06-20 00:36:43,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 00:36:56,422 INFO [train.py:996] (0/4) Epoch 3, batch 13650, loss[loss=0.2233, simple_loss=0.2873, pruned_loss=0.07963, over 21762.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3292, pruned_loss=0.0947, over 4286864.42 frames. ], batch size: 371, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:37:00,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-20 00:37:04,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=447834.0, ans=0.125 2023-06-20 00:37:15,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447894.0, ans=0.0 2023-06-20 00:37:58,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=448014.0, ans=0.0 2023-06-20 00:38:03,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-20 00:38:04,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448014.0, ans=0.1 2023-06-20 00:38:11,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.579e+02 2.975e+02 3.634e+02 4.818e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-20 00:38:43,055 INFO [train.py:996] (0/4) Epoch 3, batch 13700, loss[loss=0.2267, simple_loss=0.2919, pruned_loss=0.08076, over 21682.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3241, pruned_loss=0.09486, over 4288259.12 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:38:54,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=448134.0, ans=0.0 2023-06-20 00:40:30,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=448314.0, ans=0.125 2023-06-20 00:40:54,362 INFO [train.py:996] (0/4) Epoch 3, batch 13750, loss[loss=0.2337, simple_loss=0.3055, pruned_loss=0.08101, over 21692.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3217, pruned_loss=0.09326, over 4281474.61 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:41:34,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=448494.0, ans=0.125 2023-06-20 00:42:16,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-20 00:42:35,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.755e+02 3.135e+02 3.702e+02 5.925e+02, threshold=6.269e+02, percent-clipped=0.0 2023-06-20 00:42:42,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=448674.0, ans=0.125 2023-06-20 00:42:59,110 INFO [train.py:996] (0/4) Epoch 3, batch 13800, loss[loss=0.2497, simple_loss=0.3494, pruned_loss=0.07501, over 21601.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3271, pruned_loss=0.09278, over 4274500.85 frames. ], batch size: 263, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:43:24,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.14 vs. limit=15.0 2023-06-20 00:43:26,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=448794.0, ans=0.125 2023-06-20 00:44:58,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=448974.0, ans=0.125 2023-06-20 00:45:01,401 INFO [train.py:996] (0/4) Epoch 3, batch 13850, loss[loss=0.283, simple_loss=0.3515, pruned_loss=0.1073, over 21447.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3327, pruned_loss=0.09391, over 4266113.91 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:45:03,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=449034.0, ans=0.0 2023-06-20 00:45:15,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=449034.0, ans=0.0 2023-06-20 00:45:18,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449094.0, ans=0.125 2023-06-20 00:45:19,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449094.0, ans=0.125 2023-06-20 00:45:26,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-20 00:45:27,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449094.0, ans=0.1 2023-06-20 00:46:09,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=449154.0, ans=0.2 2023-06-20 00:46:11,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=449154.0, ans=0.125 2023-06-20 00:46:40,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.873e+02 3.393e+02 4.011e+02 6.669e+02, threshold=6.786e+02, percent-clipped=1.0 2023-06-20 00:47:04,961 INFO [train.py:996] (0/4) Epoch 3, batch 13900, loss[loss=0.2757, simple_loss=0.3424, pruned_loss=0.1045, over 21377.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.337, pruned_loss=0.09819, over 4266902.96 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:47:05,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-20 00:47:22,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=449394.0, ans=0.0 2023-06-20 00:47:52,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=449454.0, ans=0.05 2023-06-20 00:48:02,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=449454.0, ans=0.125 2023-06-20 00:48:30,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=449514.0, ans=0.125 2023-06-20 00:48:51,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=449574.0, ans=0.0 2023-06-20 00:48:56,883 INFO [train.py:996] (0/4) Epoch 3, batch 13950, loss[loss=0.258, simple_loss=0.3239, pruned_loss=0.09606, over 21850.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3369, pruned_loss=0.09992, over 4276113.79 frames. ], batch size: 332, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:49:16,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449634.0, ans=0.125 2023-06-20 00:49:32,083 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:49:33,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.10 vs. limit=10.0 2023-06-20 00:49:51,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2023-06-20 00:50:04,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=449754.0, ans=0.125 2023-06-20 00:50:25,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-20 00:50:26,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.86 vs. limit=15.0 2023-06-20 00:50:39,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.609e+02 3.095e+02 3.648e+02 5.597e+02, threshold=6.190e+02, percent-clipped=0.0 2023-06-20 00:50:57,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=449874.0, ans=0.125 2023-06-20 00:51:07,770 INFO [train.py:996] (0/4) Epoch 3, batch 14000, loss[loss=0.2151, simple_loss=0.2995, pruned_loss=0.0654, over 21775.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3325, pruned_loss=0.09752, over 4269977.76 frames. ], batch size: 298, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:52:18,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=450114.0, ans=0.0 2023-06-20 00:52:23,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=450114.0, ans=0.125 2023-06-20 00:52:49,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-20 00:52:56,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=450174.0, ans=0.125 2023-06-20 00:52:56,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=450174.0, ans=0.125 2023-06-20 00:53:00,095 INFO [train.py:996] (0/4) Epoch 3, batch 14050, loss[loss=0.2258, simple_loss=0.2906, pruned_loss=0.08046, over 21177.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3264, pruned_loss=0.09271, over 4278733.15 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:54:06,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=450354.0, ans=0.125 2023-06-20 00:54:41,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 2.482e+02 3.063e+02 3.822e+02 8.036e+02, threshold=6.126e+02, percent-clipped=3.0 2023-06-20 00:55:01,138 INFO [train.py:996] (0/4) Epoch 3, batch 14100, loss[loss=0.2271, simple_loss=0.2694, pruned_loss=0.09239, over 20230.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3214, pruned_loss=0.09248, over 4266225.96 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:56:25,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=450714.0, ans=0.0 2023-06-20 00:56:38,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=450774.0, ans=0.1 2023-06-20 00:56:50,639 INFO [train.py:996] (0/4) Epoch 3, batch 14150, loss[loss=0.2495, simple_loss=0.3227, pruned_loss=0.08816, over 21856.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3252, pruned_loss=0.09376, over 4243589.96 frames. ], batch size: 98, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 00:57:04,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=450894.0, ans=0.0 2023-06-20 00:57:11,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=450894.0, ans=0.2 2023-06-20 00:57:13,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-20 00:57:58,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.371e+02 2.924e+02 3.889e+02 6.817e+02, threshold=5.848e+02, percent-clipped=2.0 2023-06-20 00:58:20,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-20 00:58:20,974 INFO [train.py:996] (0/4) Epoch 3, batch 14200, loss[loss=0.2829, simple_loss=0.3235, pruned_loss=0.1211, over 21372.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3228, pruned_loss=0.09176, over 4249048.92 frames. ], batch size: 471, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 00:58:31,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-20 00:58:32,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=451134.0, ans=0.125 2023-06-20 00:58:54,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=451254.0, ans=0.125 2023-06-20 00:59:57,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=451374.0, ans=0.0 2023-06-20 01:00:03,131 INFO [train.py:996] (0/4) Epoch 3, batch 14250, loss[loss=0.2195, simple_loss=0.307, pruned_loss=0.06602, over 21711.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3175, pruned_loss=0.09056, over 4240170.76 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:00:10,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=451434.0, ans=0.0 2023-06-20 01:01:36,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.661e+02 3.144e+02 3.972e+02 7.330e+02, threshold=6.288e+02, percent-clipped=4.0 2023-06-20 01:01:39,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451674.0, ans=0.1 2023-06-20 01:02:01,271 INFO [train.py:996] (0/4) Epoch 3, batch 14300, loss[loss=0.3819, simple_loss=0.4557, pruned_loss=0.1541, over 21636.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3194, pruned_loss=0.09039, over 4241754.74 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:02:53,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=451794.0, ans=0.125 2023-06-20 01:02:53,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=451794.0, ans=0.04949747468305833 2023-06-20 01:03:12,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451854.0, ans=0.1 2023-06-20 01:03:57,046 INFO [train.py:996] (0/4) Epoch 3, batch 14350, loss[loss=0.2594, simple_loss=0.3327, pruned_loss=0.09307, over 21717.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.324, pruned_loss=0.09112, over 4248199.28 frames. ], batch size: 389, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:05:22,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=452214.0, ans=0.125 2023-06-20 01:05:27,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.766e+02 3.278e+02 4.344e+02 6.485e+02, threshold=6.555e+02, percent-clipped=1.0 2023-06-20 01:05:49,278 INFO [train.py:996] (0/4) Epoch 3, batch 14400, loss[loss=0.3113, simple_loss=0.348, pruned_loss=0.1373, over 21770.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3246, pruned_loss=0.09266, over 4257557.59 frames. ], batch size: 508, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:07:48,690 INFO [train.py:996] (0/4) Epoch 3, batch 14450, loss[loss=0.2129, simple_loss=0.251, pruned_loss=0.08739, over 20834.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3201, pruned_loss=0.09305, over 4254123.34 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:08:05,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=452694.0, ans=0.2 2023-06-20 01:09:02,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.591e+02 2.982e+02 3.373e+02 5.553e+02, threshold=5.963e+02, percent-clipped=0.0 2023-06-20 01:09:29,464 INFO [train.py:996] (0/4) Epoch 3, batch 14500, loss[loss=0.2456, simple_loss=0.3092, pruned_loss=0.09096, over 21428.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3163, pruned_loss=0.09179, over 4252912.06 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 32.0 2023-06-20 01:09:55,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-20 01:10:03,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-20 01:11:21,638 INFO [train.py:996] (0/4) Epoch 3, batch 14550, loss[loss=0.3097, simple_loss=0.3719, pruned_loss=0.1238, over 21343.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3215, pruned_loss=0.0936, over 4262408.13 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:12:19,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-20 01:13:07,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 2.890e+02 3.229e+02 4.188e+02 6.870e+02, threshold=6.458e+02, percent-clipped=5.0 2023-06-20 01:13:39,688 INFO [train.py:996] (0/4) Epoch 3, batch 14600, loss[loss=0.2451, simple_loss=0.2895, pruned_loss=0.1003, over 20152.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3296, pruned_loss=0.09836, over 4269735.89 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 16.0 2023-06-20 01:14:14,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=453594.0, ans=0.125 2023-06-20 01:14:40,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=453654.0, ans=0.125 2023-06-20 01:14:43,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=453654.0, ans=0.0 2023-06-20 01:14:53,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=453714.0, ans=0.2 2023-06-20 01:14:56,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=453714.0, ans=0.0 2023-06-20 01:15:27,602 INFO [train.py:996] (0/4) Epoch 3, batch 14650, loss[loss=0.2147, simple_loss=0.262, pruned_loss=0.0837, over 20986.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3323, pruned_loss=0.09866, over 4263018.08 frames. ], batch size: 608, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:15:57,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=453894.0, ans=0.125 2023-06-20 01:17:12,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 2.350e+02 2.833e+02 3.404e+02 5.520e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-20 01:17:14,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=454074.0, ans=0.0 2023-06-20 01:17:27,806 INFO [train.py:996] (0/4) Epoch 3, batch 14700, loss[loss=0.2222, simple_loss=0.3152, pruned_loss=0.06455, over 21738.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3254, pruned_loss=0.09165, over 4268127.58 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:17:52,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=454194.0, ans=0.125 2023-06-20 01:18:47,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=454314.0, ans=0.125 2023-06-20 01:19:05,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=454374.0, ans=0.125 2023-06-20 01:19:29,825 INFO [train.py:996] (0/4) Epoch 3, batch 14750, loss[loss=0.3566, simple_loss=0.4351, pruned_loss=0.139, over 21245.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3291, pruned_loss=0.09344, over 4273055.31 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 16.0 2023-06-20 01:21:21,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.805e+02 3.277e+02 4.115e+02 8.120e+02, threshold=6.554e+02, percent-clipped=5.0 2023-06-20 01:21:28,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454674.0, ans=0.125 2023-06-20 01:21:52,219 INFO [train.py:996] (0/4) Epoch 3, batch 14800, loss[loss=0.2892, simple_loss=0.3636, pruned_loss=0.1074, over 21698.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3405, pruned_loss=0.1, over 4266133.46 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:22:16,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=454734.0, ans=0.125 2023-06-20 01:22:51,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-20 01:22:54,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-20 01:23:06,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=454914.0, ans=0.125 2023-06-20 01:23:48,739 INFO [train.py:996] (0/4) Epoch 3, batch 14850, loss[loss=0.2271, simple_loss=0.2839, pruned_loss=0.0852, over 21532.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3332, pruned_loss=0.09918, over 4271532.95 frames. ], batch size: 263, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:24:12,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=455094.0, ans=0.2 2023-06-20 01:24:22,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=455094.0, ans=0.125 2023-06-20 01:24:32,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455094.0, ans=0.1 2023-06-20 01:24:57,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-20 01:25:32,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=455274.0, ans=0.2 2023-06-20 01:25:33,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 2.933e+02 3.352e+02 4.211e+02 7.293e+02, threshold=6.704e+02, percent-clipped=2.0 2023-06-20 01:25:58,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-06-20 01:26:01,553 INFO [train.py:996] (0/4) Epoch 3, batch 14900, loss[loss=0.2817, simple_loss=0.3468, pruned_loss=0.1083, over 21671.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3343, pruned_loss=0.09991, over 4272607.63 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:26:19,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=455334.0, ans=0.125 2023-06-20 01:27:27,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-20 01:28:13,436 INFO [train.py:996] (0/4) Epoch 3, batch 14950, loss[loss=0.2609, simple_loss=0.3339, pruned_loss=0.09389, over 21629.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.336, pruned_loss=0.1002, over 4265009.09 frames. ], batch size: 230, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:28:15,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=455634.0, ans=0.0 2023-06-20 01:28:53,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455694.0, ans=0.1 2023-06-20 01:29:33,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=455814.0, ans=0.125 2023-06-20 01:29:51,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455874.0, ans=0.1 2023-06-20 01:29:52,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.902e+02 3.334e+02 3.960e+02 6.522e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-20 01:30:13,464 INFO [train.py:996] (0/4) Epoch 3, batch 15000, loss[loss=0.2508, simple_loss=0.3147, pruned_loss=0.09341, over 21808.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3391, pruned_loss=0.1023, over 4268426.45 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:30:13,465 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 01:31:04,542 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2678, simple_loss=0.368, pruned_loss=0.08383, over 1796401.00 frames. 2023-06-20 01:31:04,543 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 01:31:05,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=455934.0, ans=0.0 2023-06-20 01:31:20,053 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-76000.pt 2023-06-20 01:31:24,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2023-06-20 01:31:55,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=456054.0, ans=0.125 2023-06-20 01:32:23,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=456114.0, ans=0.2 2023-06-20 01:32:32,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-20 01:32:46,004 INFO [train.py:996] (0/4) Epoch 3, batch 15050, loss[loss=0.275, simple_loss=0.3561, pruned_loss=0.09698, over 21729.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3404, pruned_loss=0.103, over 4275504.32 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:33:23,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=456294.0, ans=0.125 2023-06-20 01:33:48,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-06-20 01:34:12,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.015e+02 3.596e+02 4.330e+02 8.348e+02, threshold=7.192e+02, percent-clipped=9.0 2023-06-20 01:34:48,983 INFO [train.py:996] (0/4) Epoch 3, batch 15100, loss[loss=0.2863, simple_loss=0.3518, pruned_loss=0.1104, over 21719.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3435, pruned_loss=0.1029, over 4275403.10 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:35:58,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=456654.0, ans=0.125 2023-06-20 01:36:34,668 INFO [train.py:996] (0/4) Epoch 3, batch 15150, loss[loss=0.2809, simple_loss=0.3189, pruned_loss=0.1214, over 21244.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3416, pruned_loss=0.1036, over 4268564.42 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:36:53,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-20 01:36:55,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=456894.0, ans=0.0 2023-06-20 01:37:30,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=456954.0, ans=0.125 2023-06-20 01:37:31,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-20 01:37:50,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.762e+02 3.255e+02 4.111e+02 7.201e+02, threshold=6.510e+02, percent-clipped=1.0 2023-06-20 01:38:05,365 INFO [train.py:996] (0/4) Epoch 3, batch 15200, loss[loss=0.1968, simple_loss=0.2615, pruned_loss=0.06608, over 21746.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3306, pruned_loss=0.09906, over 4270723.29 frames. ], batch size: 112, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:38:52,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=457194.0, ans=0.125 2023-06-20 01:39:36,701 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:40:23,930 INFO [train.py:996] (0/4) Epoch 3, batch 15250, loss[loss=0.2607, simple_loss=0.3208, pruned_loss=0.1003, over 15070.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3242, pruned_loss=0.09648, over 4254270.63 frames. ], batch size: 61, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:40:52,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457494.0, ans=0.1 2023-06-20 01:41:27,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=457614.0, ans=0.2 2023-06-20 01:41:29,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-20 01:41:30,755 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:41:50,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.712e+02 3.294e+02 3.861e+02 5.748e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-20 01:42:02,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457674.0, ans=0.1 2023-06-20 01:42:17,466 INFO [train.py:996] (0/4) Epoch 3, batch 15300, loss[loss=0.3272, simple_loss=0.3685, pruned_loss=0.143, over 21434.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3277, pruned_loss=0.09978, over 4253363.63 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:43:21,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=457854.0, ans=0.125 2023-06-20 01:43:28,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=457914.0, ans=0.125 2023-06-20 01:43:41,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=457974.0, ans=0.0 2023-06-20 01:44:20,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=458034.0, ans=0.125 2023-06-20 01:44:21,161 INFO [train.py:996] (0/4) Epoch 3, batch 15350, loss[loss=0.3403, simple_loss=0.4007, pruned_loss=0.1399, over 21394.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3339, pruned_loss=0.103, over 4261226.73 frames. ], batch size: 507, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:45:08,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-20 01:45:18,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=458214.0, ans=0.0 2023-06-20 01:45:31,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.699e+02 3.240e+02 4.013e+02 6.691e+02, threshold=6.480e+02, percent-clipped=1.0 2023-06-20 01:45:42,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=458274.0, ans=0.5 2023-06-20 01:45:46,122 INFO [train.py:996] (0/4) Epoch 3, batch 15400, loss[loss=0.2247, simple_loss=0.267, pruned_loss=0.0912, over 20710.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.332, pruned_loss=0.09982, over 4255404.10 frames. ], batch size: 609, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:45:46,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=458334.0, ans=0.2 2023-06-20 01:45:50,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=458334.0, ans=0.0 2023-06-20 01:45:52,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=458334.0, ans=0.035 2023-06-20 01:46:19,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=458394.0, ans=0.0 2023-06-20 01:47:07,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458574.0, ans=0.1 2023-06-20 01:47:33,358 INFO [train.py:996] (0/4) Epoch 3, batch 15450, loss[loss=0.28, simple_loss=0.373, pruned_loss=0.09351, over 19738.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3296, pruned_loss=0.09906, over 4260896.56 frames. ], batch size: 703, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:48:40,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=458814.0, ans=0.2 2023-06-20 01:48:41,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=458814.0, ans=0.125 2023-06-20 01:48:54,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.616e+02 3.368e+02 4.050e+02 6.110e+02, threshold=6.736e+02, percent-clipped=0.0 2023-06-20 01:49:00,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=458874.0, ans=0.125 2023-06-20 01:49:32,917 INFO [train.py:996] (0/4) Epoch 3, batch 15500, loss[loss=0.2891, simple_loss=0.3528, pruned_loss=0.1127, over 21830.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3337, pruned_loss=0.09921, over 4262266.22 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:50:03,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-20 01:51:05,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=459114.0, ans=0.125 2023-06-20 01:51:26,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=459174.0, ans=0.125 2023-06-20 01:51:48,344 INFO [train.py:996] (0/4) Epoch 3, batch 15550, loss[loss=0.2438, simple_loss=0.3172, pruned_loss=0.08514, over 21798.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3331, pruned_loss=0.09668, over 4264761.30 frames. ], batch size: 371, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:52:14,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=459294.0, ans=0.125 2023-06-20 01:52:28,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=459354.0, ans=0.125 2023-06-20 01:52:28,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=459354.0, ans=0.125 2023-06-20 01:52:34,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=459354.0, ans=0.125 2023-06-20 01:53:14,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.62 vs. limit=22.5 2023-06-20 01:53:14,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.425e+02 2.682e+02 3.354e+02 5.431e+02, threshold=5.364e+02, percent-clipped=0.0 2023-06-20 01:53:26,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=459474.0, ans=0.0 2023-06-20 01:53:31,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=459474.0, ans=0.0 2023-06-20 01:53:35,285 INFO [train.py:996] (0/4) Epoch 3, batch 15600, loss[loss=0.2347, simple_loss=0.3053, pruned_loss=0.08203, over 21724.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3268, pruned_loss=0.09432, over 4269386.57 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:53:47,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=459534.0, ans=0.125 2023-06-20 01:53:59,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=459594.0, ans=0.125 2023-06-20 01:55:06,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459774.0, ans=0.1 2023-06-20 01:55:34,237 INFO [train.py:996] (0/4) Epoch 3, batch 15650, loss[loss=0.2418, simple_loss=0.3038, pruned_loss=0.08995, over 21324.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3259, pruned_loss=0.09385, over 4269564.74 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:55:53,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=459834.0, ans=10.0 2023-06-20 01:56:03,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459894.0, ans=0.1 2023-06-20 01:56:08,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-20 01:56:09,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-20 01:56:27,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=459954.0, ans=0.04949747468305833 2023-06-20 01:56:30,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=459954.0, ans=0.125 2023-06-20 01:56:57,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=460074.0, ans=0.125 2023-06-20 01:56:57,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.476e+02 2.884e+02 3.400e+02 5.919e+02, threshold=5.768e+02, percent-clipped=1.0 2023-06-20 01:57:09,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=460074.0, ans=0.0 2023-06-20 01:57:17,714 INFO [train.py:996] (0/4) Epoch 3, batch 15700, loss[loss=0.2203, simple_loss=0.2797, pruned_loss=0.08047, over 21621.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3211, pruned_loss=0.09264, over 4267721.39 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:58:18,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=460254.0, ans=0.125 2023-06-20 01:58:23,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=460254.0, ans=0.0 2023-06-20 01:58:26,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2023-06-20 01:58:55,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=460374.0, ans=0.125 2023-06-20 01:59:18,031 INFO [train.py:996] (0/4) Epoch 3, batch 15750, loss[loss=0.253, simple_loss=0.3167, pruned_loss=0.09458, over 21601.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3164, pruned_loss=0.09224, over 4269230.64 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 01:59:32,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=460494.0, ans=0.0 2023-06-20 01:59:58,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460554.0, ans=0.0 2023-06-20 02:00:50,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=460614.0, ans=0.0 2023-06-20 02:00:54,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.350e+02 2.605e+02 2.992e+02 4.357e+02, threshold=5.211e+02, percent-clipped=0.0 2023-06-20 02:01:21,482 INFO [train.py:996] (0/4) Epoch 3, batch 15800, loss[loss=0.2394, simple_loss=0.2986, pruned_loss=0.09016, over 21663.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3125, pruned_loss=0.09211, over 4273975.85 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:02:05,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460854.0, ans=0.1 2023-06-20 02:02:26,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=460914.0, ans=0.125 2023-06-20 02:02:41,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=460914.0, ans=0.125 2023-06-20 02:03:00,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460974.0, ans=0.0 2023-06-20 02:03:13,423 INFO [train.py:996] (0/4) Epoch 3, batch 15850, loss[loss=0.2532, simple_loss=0.3182, pruned_loss=0.09409, over 21935.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3153, pruned_loss=0.09481, over 4261467.43 frames. ], batch size: 317, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:03:22,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-20 02:03:39,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=461094.0, ans=0.1 2023-06-20 02:03:52,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=461154.0, ans=0.0 2023-06-20 02:04:02,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=461154.0, ans=0.025 2023-06-20 02:04:04,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=461214.0, ans=0.125 2023-06-20 02:04:15,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=461214.0, ans=0.125 2023-06-20 02:04:17,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=461214.0, ans=0.0 2023-06-20 02:04:33,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.753e+02 3.333e+02 4.254e+02 6.443e+02, threshold=6.666e+02, percent-clipped=4.0 2023-06-20 02:04:50,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=8.0 2023-06-20 02:04:59,060 INFO [train.py:996] (0/4) Epoch 3, batch 15900, loss[loss=0.2384, simple_loss=0.2866, pruned_loss=0.09509, over 21514.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3121, pruned_loss=0.09448, over 4272130.16 frames. ], batch size: 212, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:05:01,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=461334.0, ans=0.2 2023-06-20 02:05:02,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=461334.0, ans=0.2 2023-06-20 02:05:10,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-20 02:05:27,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.18 vs. limit=15.0 2023-06-20 02:05:43,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=461454.0, ans=0.2 2023-06-20 02:06:30,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=461574.0, ans=0.0 2023-06-20 02:06:52,742 INFO [train.py:996] (0/4) Epoch 3, batch 15950, loss[loss=0.2129, simple_loss=0.2961, pruned_loss=0.06481, over 21212.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3126, pruned_loss=0.09104, over 4259018.77 frames. ], batch size: 159, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:07:27,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=461694.0, ans=0.1 2023-06-20 02:08:25,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=461814.0, ans=0.0 2023-06-20 02:08:25,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=461814.0, ans=0.125 2023-06-20 02:08:30,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.370e+02 2.785e+02 3.587e+02 6.147e+02, threshold=5.571e+02, percent-clipped=0.0 2023-06-20 02:08:37,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=461874.0, ans=0.125 2023-06-20 02:08:51,762 INFO [train.py:996] (0/4) Epoch 3, batch 16000, loss[loss=0.2347, simple_loss=0.3279, pruned_loss=0.07071, over 21788.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3147, pruned_loss=0.0896, over 4262694.43 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:09:07,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=461934.0, ans=0.125 2023-06-20 02:09:15,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461994.0, ans=0.1 2023-06-20 02:09:18,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=461994.0, ans=0.0 2023-06-20 02:10:38,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-20 02:11:01,270 INFO [train.py:996] (0/4) Epoch 3, batch 16050, loss[loss=0.2461, simple_loss=0.3268, pruned_loss=0.08274, over 21422.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3181, pruned_loss=0.08762, over 4266711.10 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-20 02:11:39,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-20 02:12:28,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.489e+02 3.002e+02 4.044e+02 8.008e+02, threshold=6.003e+02, percent-clipped=4.0 2023-06-20 02:12:39,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462474.0, ans=0.1 2023-06-20 02:12:39,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462474.0, ans=0.1 2023-06-20 02:12:50,215 INFO [train.py:996] (0/4) Epoch 3, batch 16100, loss[loss=0.2614, simple_loss=0.3225, pruned_loss=0.1002, over 21874.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3227, pruned_loss=0.08957, over 4278809.40 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:13:30,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=462594.0, ans=0.125 2023-06-20 02:14:49,953 INFO [train.py:996] (0/4) Epoch 3, batch 16150, loss[loss=0.2683, simple_loss=0.3203, pruned_loss=0.1081, over 21315.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3245, pruned_loss=0.09204, over 4283447.19 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:15:01,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=462834.0, ans=0.125 2023-06-20 02:15:29,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=462954.0, ans=0.125 2023-06-20 02:16:17,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.540e+02 2.925e+02 3.494e+02 4.930e+02, threshold=5.850e+02, percent-clipped=0.0 2023-06-20 02:16:47,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-20 02:16:48,932 INFO [train.py:996] (0/4) Epoch 3, batch 16200, loss[loss=0.2285, simple_loss=0.3097, pruned_loss=0.07368, over 21629.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.328, pruned_loss=0.09265, over 4283211.33 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:16:56,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463134.0, ans=0.125 2023-06-20 02:17:08,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=463194.0, ans=0.2 2023-06-20 02:17:14,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=463194.0, ans=0.125 2023-06-20 02:17:30,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463254.0, ans=0.125 2023-06-20 02:17:35,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=463254.0, ans=10.0 2023-06-20 02:17:51,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463314.0, ans=0.1 2023-06-20 02:17:52,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=15.0 2023-06-20 02:18:21,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=463374.0, ans=0.125 2023-06-20 02:18:22,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=463374.0, ans=0.125 2023-06-20 02:18:39,601 INFO [train.py:996] (0/4) Epoch 3, batch 16250, loss[loss=0.2142, simple_loss=0.2843, pruned_loss=0.07209, over 21663.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3276, pruned_loss=0.09308, over 4277472.78 frames. ], batch size: 298, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:18:53,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=22.5 2023-06-20 02:18:58,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=463434.0, ans=0.125 2023-06-20 02:20:10,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.534e+02 3.093e+02 3.878e+02 6.087e+02, threshold=6.186e+02, percent-clipped=1.0 2023-06-20 02:20:29,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=463734.0, ans=0.0 2023-06-20 02:20:30,344 INFO [train.py:996] (0/4) Epoch 3, batch 16300, loss[loss=0.2044, simple_loss=0.2724, pruned_loss=0.06814, over 21280.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3197, pruned_loss=0.08865, over 4264875.44 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:20:49,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=463794.0, ans=0.125 2023-06-20 02:21:00,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=463794.0, ans=0.125 2023-06-20 02:21:18,590 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:21:58,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=463914.0, ans=0.0 2023-06-20 02:22:17,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=463974.0, ans=0.2 2023-06-20 02:22:19,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463974.0, ans=0.125 2023-06-20 02:22:27,464 INFO [train.py:996] (0/4) Epoch 3, batch 16350, loss[loss=0.3149, simple_loss=0.384, pruned_loss=0.1229, over 21775.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3201, pruned_loss=0.09014, over 4263861.13 frames. ], batch size: 118, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:22:29,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=464034.0, ans=0.125 2023-06-20 02:23:45,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=464154.0, ans=0.05 2023-06-20 02:23:58,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=464214.0, ans=0.125 2023-06-20 02:24:13,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.570e+02 3.121e+02 3.955e+02 6.816e+02, threshold=6.242e+02, percent-clipped=2.0 2023-06-20 02:24:40,017 INFO [train.py:996] (0/4) Epoch 3, batch 16400, loss[loss=0.256, simple_loss=0.3414, pruned_loss=0.08534, over 21331.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3247, pruned_loss=0.09249, over 4261739.30 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:25:31,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=464454.0, ans=0.2 2023-06-20 02:26:46,839 INFO [train.py:996] (0/4) Epoch 3, batch 16450, loss[loss=0.2828, simple_loss=0.3391, pruned_loss=0.1133, over 20707.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3238, pruned_loss=0.09297, over 4268339.24 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:26:50,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=464634.0, ans=0.0 2023-06-20 02:27:16,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=464694.0, ans=0.2 2023-06-20 02:27:19,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=464694.0, ans=0.07 2023-06-20 02:27:40,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=464754.0, ans=0.125 2023-06-20 02:28:32,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.737e+02 3.058e+02 3.975e+02 7.376e+02, threshold=6.116e+02, percent-clipped=5.0 2023-06-20 02:28:34,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-20 02:28:52,848 INFO [train.py:996] (0/4) Epoch 3, batch 16500, loss[loss=0.2338, simple_loss=0.3075, pruned_loss=0.08008, over 21788.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3241, pruned_loss=0.09349, over 4269044.79 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:29:17,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=464994.0, ans=0.125 2023-06-20 02:30:31,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=465114.0, ans=0.125 2023-06-20 02:30:51,346 INFO [train.py:996] (0/4) Epoch 3, batch 16550, loss[loss=0.246, simple_loss=0.3381, pruned_loss=0.07699, over 20890.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3195, pruned_loss=0.0902, over 4270699.75 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 64.0 2023-06-20 02:31:46,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=465294.0, ans=0.125 2023-06-20 02:31:50,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=465354.0, ans=0.0 2023-06-20 02:31:51,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-20 02:32:38,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.553e+02 2.978e+02 3.771e+02 6.936e+02, threshold=5.956e+02, percent-clipped=3.0 2023-06-20 02:33:16,491 INFO [train.py:996] (0/4) Epoch 3, batch 16600, loss[loss=0.2362, simple_loss=0.3352, pruned_loss=0.06858, over 20782.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3288, pruned_loss=0.09379, over 4273656.87 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:34:53,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-20 02:35:24,719 INFO [train.py:996] (0/4) Epoch 3, batch 16650, loss[loss=0.2941, simple_loss=0.3666, pruned_loss=0.1108, over 21546.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3403, pruned_loss=0.0966, over 4271431.98 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:36:51,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=466014.0, ans=0.0 2023-06-20 02:37:14,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 2.861e+02 3.349e+02 3.962e+02 7.489e+02, threshold=6.698e+02, percent-clipped=1.0 2023-06-20 02:37:29,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=466074.0, ans=0.125 2023-06-20 02:37:33,543 INFO [train.py:996] (0/4) Epoch 3, batch 16700, loss[loss=0.1905, simple_loss=0.2533, pruned_loss=0.0639, over 21879.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3401, pruned_loss=0.09725, over 4267284.93 frames. ], batch size: 98, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:38:21,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-20 02:38:22,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=466254.0, ans=0.125 2023-06-20 02:39:00,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=466314.0, ans=0.125 2023-06-20 02:39:02,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=466314.0, ans=0.125 2023-06-20 02:39:35,645 INFO [train.py:996] (0/4) Epoch 3, batch 16750, loss[loss=0.2037, simple_loss=0.2564, pruned_loss=0.07556, over 21683.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3425, pruned_loss=0.1002, over 4271046.16 frames. ], batch size: 112, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:40:29,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=466494.0, ans=0.0 2023-06-20 02:40:46,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-20 02:41:18,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-06-20 02:41:41,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=466674.0, ans=0.0 2023-06-20 02:41:44,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.879e+02 3.240e+02 3.697e+02 5.648e+02, threshold=6.479e+02, percent-clipped=0.0 2023-06-20 02:41:44,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=466674.0, ans=0.025 2023-06-20 02:41:57,465 INFO [train.py:996] (0/4) Epoch 3, batch 16800, loss[loss=0.2399, simple_loss=0.3001, pruned_loss=0.0898, over 21303.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3479, pruned_loss=0.1006, over 4263874.78 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:41:59,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466734.0, ans=0.1 2023-06-20 02:41:59,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=466734.0, ans=0.04949747468305833 2023-06-20 02:42:29,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-20 02:42:44,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466794.0, ans=0.1 2023-06-20 02:42:58,393 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:44:01,335 INFO [train.py:996] (0/4) Epoch 3, batch 16850, loss[loss=0.2587, simple_loss=0.3172, pruned_loss=0.1001, over 21894.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3441, pruned_loss=0.1005, over 4265826.74 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:44:36,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.17 vs. limit=6.0 2023-06-20 02:45:22,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.543e+02 2.978e+02 3.751e+02 8.288e+02, threshold=5.956e+02, percent-clipped=4.0 2023-06-20 02:45:37,906 INFO [train.py:996] (0/4) Epoch 3, batch 16900, loss[loss=0.2367, simple_loss=0.3023, pruned_loss=0.08558, over 21628.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3384, pruned_loss=0.09862, over 4272297.87 frames. ], batch size: 414, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:45:44,497 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:46:02,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467394.0, ans=0.1 2023-06-20 02:46:09,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=467394.0, ans=0.125 2023-06-20 02:46:26,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=467454.0, ans=0.0 2023-06-20 02:46:32,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=467454.0, ans=0.2 2023-06-20 02:46:49,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=467514.0, ans=0.125 2023-06-20 02:47:13,024 INFO [train.py:996] (0/4) Epoch 3, batch 16950, loss[loss=0.2346, simple_loss=0.2896, pruned_loss=0.08977, over 21161.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.329, pruned_loss=0.09594, over 4269025.37 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:47:16,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-20 02:47:25,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=467634.0, ans=15.0 2023-06-20 02:47:51,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467694.0, ans=0.1 2023-06-20 02:48:23,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=467754.0, ans=0.125 2023-06-20 02:48:26,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=467754.0, ans=0.0 2023-06-20 02:48:54,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.520e+02 2.812e+02 3.324e+02 6.057e+02, threshold=5.623e+02, percent-clipped=0.0 2023-06-20 02:49:08,250 INFO [train.py:996] (0/4) Epoch 3, batch 17000, loss[loss=0.2535, simple_loss=0.3158, pruned_loss=0.09563, over 21712.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3265, pruned_loss=0.09659, over 4278772.10 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:49:38,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=467934.0, ans=0.125 2023-06-20 02:49:59,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=467994.0, ans=0.0 2023-06-20 02:50:00,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=467994.0, ans=0.2 2023-06-20 02:50:54,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=468174.0, ans=0.125 2023-06-20 02:50:59,625 INFO [train.py:996] (0/4) Epoch 3, batch 17050, loss[loss=0.2752, simple_loss=0.3477, pruned_loss=0.1013, over 21169.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3347, pruned_loss=0.09986, over 4288103.52 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:51:03,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=468234.0, ans=0.125 2023-06-20 02:51:18,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=468234.0, ans=0.125 2023-06-20 02:51:21,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.36 vs. limit=6.0 2023-06-20 02:51:28,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-20 02:51:37,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=468294.0, ans=0.2 2023-06-20 02:51:51,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=468354.0, ans=0.125 2023-06-20 02:52:20,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=468474.0, ans=0.125 2023-06-20 02:52:22,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.663e+02 3.089e+02 3.979e+02 5.737e+02, threshold=6.177e+02, percent-clipped=2.0 2023-06-20 02:52:33,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=468474.0, ans=0.125 2023-06-20 02:52:33,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=468474.0, ans=0.0 2023-06-20 02:52:35,777 INFO [train.py:996] (0/4) Epoch 3, batch 17100, loss[loss=0.2608, simple_loss=0.3235, pruned_loss=0.0991, over 21338.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3339, pruned_loss=0.1012, over 4280543.90 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:53:41,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=468714.0, ans=0.125 2023-06-20 02:53:50,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=468774.0, ans=0.0 2023-06-20 02:54:11,172 INFO [train.py:996] (0/4) Epoch 3, batch 17150, loss[loss=0.2503, simple_loss=0.3027, pruned_loss=0.09897, over 21576.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3289, pruned_loss=0.1005, over 4291917.51 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:54:36,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-20 02:55:03,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=468894.0, ans=0.125 2023-06-20 02:55:23,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=468954.0, ans=0.125 2023-06-20 02:55:48,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=469014.0, ans=0.125 2023-06-20 02:55:59,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.624e+02 2.921e+02 3.624e+02 5.783e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-20 02:56:28,762 INFO [train.py:996] (0/4) Epoch 3, batch 17200, loss[loss=0.2757, simple_loss=0.3404, pruned_loss=0.1055, over 19981.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3293, pruned_loss=0.1002, over 4287782.64 frames. ], batch size: 702, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:56:30,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=469134.0, ans=0.0 2023-06-20 02:56:38,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=469134.0, ans=0.2 2023-06-20 02:57:06,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-20 02:57:07,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=469194.0, ans=0.125 2023-06-20 02:57:39,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=469314.0, ans=0.125 2023-06-20 02:58:07,489 INFO [train.py:996] (0/4) Epoch 3, batch 17250, loss[loss=0.2875, simple_loss=0.3596, pruned_loss=0.1077, over 21661.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.333, pruned_loss=0.1022, over 4280566.95 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 02:58:08,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=469434.0, ans=0.04949747468305833 2023-06-20 02:58:16,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2023-06-20 02:59:04,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=469554.0, ans=0.1 2023-06-20 02:59:20,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=469614.0, ans=0.125 2023-06-20 02:59:28,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-20 02:59:32,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.790e+02 3.289e+02 4.103e+02 7.188e+02, threshold=6.577e+02, percent-clipped=6.0 2023-06-20 02:59:34,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=469674.0, ans=0.2 2023-06-20 02:59:43,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=469674.0, ans=0.02 2023-06-20 02:59:44,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=469674.0, ans=0.0 2023-06-20 02:59:56,842 INFO [train.py:996] (0/4) Epoch 3, batch 17300, loss[loss=0.2972, simple_loss=0.3565, pruned_loss=0.119, over 21816.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3414, pruned_loss=0.1058, over 4281062.71 frames. ], batch size: 124, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:00:25,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=469794.0, ans=0.2 2023-06-20 03:00:41,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=469854.0, ans=0.125 2023-06-20 03:00:41,487 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:01:17,213 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:01:24,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=469974.0, ans=0.0 2023-06-20 03:01:25,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=469974.0, ans=0.2 2023-06-20 03:01:30,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-20 03:01:37,010 INFO [train.py:996] (0/4) Epoch 3, batch 17350, loss[loss=0.2096, simple_loss=0.291, pruned_loss=0.06404, over 21724.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3435, pruned_loss=0.106, over 4286036.78 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:03:09,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.914e+02 3.379e+02 4.234e+02 7.402e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-20 03:03:15,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-20 03:03:23,321 INFO [train.py:996] (0/4) Epoch 3, batch 17400, loss[loss=0.3288, simple_loss=0.3991, pruned_loss=0.1292, over 21456.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3403, pruned_loss=0.1025, over 4280500.51 frames. ], batch size: 471, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:04:05,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=470394.0, ans=0.0 2023-06-20 03:04:47,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-20 03:05:05,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=470514.0, ans=0.125 2023-06-20 03:05:25,725 INFO [train.py:996] (0/4) Epoch 3, batch 17450, loss[loss=0.2153, simple_loss=0.2847, pruned_loss=0.07295, over 21248.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.334, pruned_loss=0.09847, over 4276566.21 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:06:28,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=470754.0, ans=10.0 2023-06-20 03:06:47,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.479e+02 3.095e+02 3.749e+02 7.284e+02, threshold=6.191e+02, percent-clipped=2.0 2023-06-20 03:06:57,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=470874.0, ans=0.0 2023-06-20 03:07:04,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-20 03:07:06,290 INFO [train.py:996] (0/4) Epoch 3, batch 17500, loss[loss=0.2738, simple_loss=0.3296, pruned_loss=0.109, over 21853.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3281, pruned_loss=0.09491, over 4275401.74 frames. ], batch size: 414, lr: 1.08e-02, grad_scale: 32.0 2023-06-20 03:07:24,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=470934.0, ans=0.125 2023-06-20 03:07:47,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=471054.0, ans=0.05 2023-06-20 03:07:55,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=471054.0, ans=0.0 2023-06-20 03:07:57,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471054.0, ans=0.1 2023-06-20 03:08:41,870 INFO [train.py:996] (0/4) Epoch 3, batch 17550, loss[loss=0.2361, simple_loss=0.318, pruned_loss=0.07714, over 21759.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3269, pruned_loss=0.09287, over 4258023.03 frames. ], batch size: 112, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:08:51,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=471234.0, ans=0.0 2023-06-20 03:09:14,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=471294.0, ans=0.125 2023-06-20 03:09:41,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=471414.0, ans=0.125 2023-06-20 03:09:41,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=471414.0, ans=0.125 2023-06-20 03:09:47,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=471414.0, ans=0.125 2023-06-20 03:09:54,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.408e+02 2.804e+02 3.227e+02 5.442e+02, threshold=5.608e+02, percent-clipped=0.0 2023-06-20 03:10:14,346 INFO [train.py:996] (0/4) Epoch 3, batch 17600, loss[loss=0.2948, simple_loss=0.3637, pruned_loss=0.1129, over 21607.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3307, pruned_loss=0.0943, over 4267805.90 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:11:11,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=471654.0, ans=0.125 2023-06-20 03:11:46,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=471774.0, ans=0.125 2023-06-20 03:11:55,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=471774.0, ans=0.2 2023-06-20 03:12:01,254 INFO [train.py:996] (0/4) Epoch 3, batch 17650, loss[loss=0.2879, simple_loss=0.3514, pruned_loss=0.1122, over 21514.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3298, pruned_loss=0.09483, over 4259349.02 frames. ], batch size: 509, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:13:35,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.510e+02 2.877e+02 3.404e+02 6.188e+02, threshold=5.753e+02, percent-clipped=2.0 2023-06-20 03:13:36,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=472074.0, ans=0.125 2023-06-20 03:13:38,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-20 03:13:47,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=472074.0, ans=0.125 2023-06-20 03:13:54,786 INFO [train.py:996] (0/4) Epoch 3, batch 17700, loss[loss=0.305, simple_loss=0.3749, pruned_loss=0.1176, over 21571.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3233, pruned_loss=0.09172, over 4256678.14 frames. ], batch size: 414, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:14:15,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472134.0, ans=0.1 2023-06-20 03:15:38,917 INFO [train.py:996] (0/4) Epoch 3, batch 17750, loss[loss=0.2867, simple_loss=0.3614, pruned_loss=0.106, over 21983.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3311, pruned_loss=0.09574, over 4264115.92 frames. ], batch size: 317, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:15:57,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=472434.0, ans=0.125 2023-06-20 03:16:06,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=472494.0, ans=0.05 2023-06-20 03:16:46,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=472614.0, ans=0.0 2023-06-20 03:17:02,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=472674.0, ans=0.125 2023-06-20 03:17:03,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.594e+02 2.990e+02 3.478e+02 6.591e+02, threshold=5.980e+02, percent-clipped=3.0 2023-06-20 03:17:15,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=472674.0, ans=0.125 2023-06-20 03:17:27,835 INFO [train.py:996] (0/4) Epoch 3, batch 17800, loss[loss=0.2463, simple_loss=0.322, pruned_loss=0.08533, over 21723.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3293, pruned_loss=0.09417, over 4251934.38 frames. ], batch size: 332, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:17:58,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=472854.0, ans=0.07 2023-06-20 03:18:05,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-20 03:19:05,727 INFO [train.py:996] (0/4) Epoch 3, batch 17850, loss[loss=0.2866, simple_loss=0.3545, pruned_loss=0.1094, over 20710.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3314, pruned_loss=0.09535, over 4258866.68 frames. ], batch size: 607, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:19:24,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=473094.0, ans=0.0 2023-06-20 03:20:28,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=473214.0, ans=0.0 2023-06-20 03:20:42,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.814e+02 3.255e+02 4.435e+02 8.070e+02, threshold=6.511e+02, percent-clipped=5.0 2023-06-20 03:20:47,757 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:20:55,849 INFO [train.py:996] (0/4) Epoch 3, batch 17900, loss[loss=0.2378, simple_loss=0.3169, pruned_loss=0.07939, over 21362.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3371, pruned_loss=0.09806, over 4261412.77 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:21:03,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473334.0, ans=0.1 2023-06-20 03:21:36,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=473394.0, ans=0.05 2023-06-20 03:21:59,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-20 03:22:07,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=473454.0, ans=0.2 2023-06-20 03:22:24,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-20 03:22:43,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473574.0, ans=0.1 2023-06-20 03:22:53,611 INFO [train.py:996] (0/4) Epoch 3, batch 17950, loss[loss=0.2402, simple_loss=0.3323, pruned_loss=0.07401, over 21634.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3356, pruned_loss=0.09366, over 4258644.14 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:22:55,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=473634.0, ans=0.125 2023-06-20 03:23:14,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-20 03:23:45,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-20 03:24:29,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.407e+02 3.130e+02 3.720e+02 6.419e+02, threshold=6.259e+02, percent-clipped=0.0 2023-06-20 03:24:42,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=473934.0, ans=0.0 2023-06-20 03:24:43,119 INFO [train.py:996] (0/4) Epoch 3, batch 18000, loss[loss=0.2099, simple_loss=0.2968, pruned_loss=0.06145, over 20775.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3288, pruned_loss=0.09173, over 4260086.67 frames. ], batch size: 607, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:24:43,120 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 03:25:44,236 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2752, simple_loss=0.3767, pruned_loss=0.08679, over 1796401.00 frames. 2023-06-20 03:25:44,237 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 03:25:51,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=473934.0, ans=10.0 2023-06-20 03:26:15,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=473994.0, ans=0.0 2023-06-20 03:26:23,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=15.0 2023-06-20 03:26:43,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=474054.0, ans=0.125 2023-06-20 03:26:55,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=8.0 2023-06-20 03:27:07,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=474174.0, ans=10.0 2023-06-20 03:27:07,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=474174.0, ans=0.125 2023-06-20 03:27:25,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-20 03:27:26,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=474234.0, ans=0.125 2023-06-20 03:27:26,968 INFO [train.py:996] (0/4) Epoch 3, batch 18050, loss[loss=0.2233, simple_loss=0.2854, pruned_loss=0.0806, over 21540.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3227, pruned_loss=0.09, over 4257710.24 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:28:45,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.808e+02 3.337e+02 3.938e+02 6.336e+02, threshold=6.674e+02, percent-clipped=1.0 2023-06-20 03:29:05,296 INFO [train.py:996] (0/4) Epoch 3, batch 18100, loss[loss=0.2573, simple_loss=0.3514, pruned_loss=0.08158, over 21825.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3275, pruned_loss=0.09238, over 4263500.42 frames. ], batch size: 372, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:29:40,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=474594.0, ans=15.0 2023-06-20 03:30:38,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=474774.0, ans=0.0 2023-06-20 03:30:43,528 INFO [train.py:996] (0/4) Epoch 3, batch 18150, loss[loss=0.2549, simple_loss=0.3426, pruned_loss=0.08363, over 20706.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3303, pruned_loss=0.09326, over 4257074.71 frames. ], batch size: 607, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:31:26,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=474894.0, ans=0.125 2023-06-20 03:32:04,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.533e+02 2.904e+02 3.319e+02 6.906e+02, threshold=5.807e+02, percent-clipped=1.0 2023-06-20 03:32:04,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475074.0, ans=0.1 2023-06-20 03:32:29,456 INFO [train.py:996] (0/4) Epoch 3, batch 18200, loss[loss=0.2284, simple_loss=0.2866, pruned_loss=0.08511, over 21767.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3241, pruned_loss=0.09238, over 4253496.36 frames. ], batch size: 300, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:32:38,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=475134.0, ans=0.0 2023-06-20 03:33:14,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=475254.0, ans=0.0 2023-06-20 03:33:31,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=475314.0, ans=0.035 2023-06-20 03:33:34,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=475314.0, ans=0.125 2023-06-20 03:33:46,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=475374.0, ans=0.125 2023-06-20 03:33:49,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=475374.0, ans=0.125 2023-06-20 03:34:01,005 INFO [train.py:996] (0/4) Epoch 3, batch 18250, loss[loss=0.3147, simple_loss=0.3796, pruned_loss=0.1249, over 19982.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3154, pruned_loss=0.08887, over 4252389.75 frames. ], batch size: 702, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:34:03,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-20 03:35:15,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-20 03:35:24,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 2.363e+02 2.913e+02 3.587e+02 8.946e+02, threshold=5.827e+02, percent-clipped=7.0 2023-06-20 03:35:38,295 INFO [train.py:996] (0/4) Epoch 3, batch 18300, loss[loss=0.2261, simple_loss=0.2947, pruned_loss=0.07873, over 21848.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3161, pruned_loss=0.0892, over 4262407.87 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:36:29,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=22.5 2023-06-20 03:36:56,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=475914.0, ans=10.0 2023-06-20 03:36:59,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=475974.0, ans=0.125 2023-06-20 03:37:14,087 INFO [train.py:996] (0/4) Epoch 3, batch 18350, loss[loss=0.2622, simple_loss=0.3322, pruned_loss=0.09608, over 21597.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3205, pruned_loss=0.08906, over 4247939.42 frames. ], batch size: 414, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:38:10,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-20 03:38:37,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=476214.0, ans=0.2 2023-06-20 03:38:44,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.546e+02 3.125e+02 3.968e+02 7.184e+02, threshold=6.249e+02, percent-clipped=4.0 2023-06-20 03:38:58,461 INFO [train.py:996] (0/4) Epoch 3, batch 18400, loss[loss=0.1933, simple_loss=0.2733, pruned_loss=0.05662, over 21285.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3154, pruned_loss=0.08794, over 4252694.72 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:39:44,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476454.0, ans=0.1 2023-06-20 03:39:46,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=476454.0, ans=0.125 2023-06-20 03:40:04,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=476514.0, ans=22.5 2023-06-20 03:40:05,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=476514.0, ans=0.0 2023-06-20 03:40:35,479 INFO [train.py:996] (0/4) Epoch 3, batch 18450, loss[loss=0.2173, simple_loss=0.2984, pruned_loss=0.06815, over 21896.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.31, pruned_loss=0.08315, over 4245659.71 frames. ], batch size: 373, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:40:50,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=476634.0, ans=0.0 2023-06-20 03:41:20,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476754.0, ans=0.1 2023-06-20 03:41:57,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.233e+02 2.788e+02 3.654e+02 6.686e+02, threshold=5.575e+02, percent-clipped=3.0 2023-06-20 03:42:20,996 INFO [train.py:996] (0/4) Epoch 3, batch 18500, loss[loss=0.2294, simple_loss=0.2805, pruned_loss=0.08911, over 21398.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3052, pruned_loss=0.08246, over 4242155.19 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:42:48,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476994.0, ans=0.1 2023-06-20 03:43:31,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=477054.0, ans=0.0 2023-06-20 03:43:54,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-20 03:44:04,051 INFO [train.py:996] (0/4) Epoch 3, batch 18550, loss[loss=0.2388, simple_loss=0.2899, pruned_loss=0.09391, over 21313.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3046, pruned_loss=0.08199, over 4242396.00 frames. ], batch size: 160, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:44:04,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=477234.0, ans=0.125 2023-06-20 03:44:11,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=477234.0, ans=0.2 2023-06-20 03:44:26,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-20 03:44:49,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=477294.0, ans=0.125 2023-06-20 03:45:14,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=477354.0, ans=0.0 2023-06-20 03:45:21,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-20 03:45:38,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.443e+02 2.739e+02 3.357e+02 5.521e+02, threshold=5.479e+02, percent-clipped=0.0 2023-06-20 03:45:51,270 INFO [train.py:996] (0/4) Epoch 3, batch 18600, loss[loss=0.2028, simple_loss=0.2719, pruned_loss=0.0669, over 21812.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3046, pruned_loss=0.08349, over 4240938.08 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:46:08,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477534.0, ans=0.1 2023-06-20 03:46:51,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=477654.0, ans=0.025 2023-06-20 03:47:11,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-20 03:47:33,517 INFO [train.py:996] (0/4) Epoch 3, batch 18650, loss[loss=0.2423, simple_loss=0.2848, pruned_loss=0.09991, over 20337.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3043, pruned_loss=0.08373, over 4251927.96 frames. ], batch size: 703, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:47:35,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=477834.0, ans=0.09899494936611666 2023-06-20 03:47:45,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=477834.0, ans=0.2 2023-06-20 03:48:37,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=478014.0, ans=0.0 2023-06-20 03:48:51,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.448e+02 2.806e+02 3.441e+02 5.689e+02, threshold=5.611e+02, percent-clipped=1.0 2023-06-20 03:49:03,415 INFO [train.py:996] (0/4) Epoch 3, batch 18700, loss[loss=0.2388, simple_loss=0.2986, pruned_loss=0.08943, over 21803.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3025, pruned_loss=0.08592, over 4250535.72 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:50:05,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=478254.0, ans=0.0 2023-06-20 03:50:35,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=478374.0, ans=0.125 2023-06-20 03:50:39,453 INFO [train.py:996] (0/4) Epoch 3, batch 18750, loss[loss=0.3445, simple_loss=0.3988, pruned_loss=0.1451, over 21643.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3062, pruned_loss=0.08963, over 4265233.70 frames. ], batch size: 414, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:52:11,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.724e+02 3.056e+02 3.518e+02 6.406e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-20 03:52:24,556 INFO [train.py:996] (0/4) Epoch 3, batch 18800, loss[loss=0.1988, simple_loss=0.2787, pruned_loss=0.05952, over 21581.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3123, pruned_loss=0.09105, over 4263367.21 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:53:24,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=478854.0, ans=0.0 2023-06-20 03:53:25,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=22.5 2023-06-20 03:53:37,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-20 03:54:08,163 INFO [train.py:996] (0/4) Epoch 3, batch 18850, loss[loss=0.2323, simple_loss=0.2945, pruned_loss=0.08499, over 21743.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3082, pruned_loss=0.08524, over 4264174.99 frames. ], batch size: 316, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:54:08,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=479034.0, ans=0.125 2023-06-20 03:54:17,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479034.0, ans=0.0 2023-06-20 03:54:38,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.19 vs. limit=5.0 2023-06-20 03:55:26,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=479214.0, ans=0.125 2023-06-20 03:55:38,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479274.0, ans=0.125 2023-06-20 03:55:41,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 2.197e+02 2.509e+02 3.227e+02 5.628e+02, threshold=5.018e+02, percent-clipped=1.0 2023-06-20 03:56:10,928 INFO [train.py:996] (0/4) Epoch 3, batch 18900, loss[loss=0.2444, simple_loss=0.2931, pruned_loss=0.09786, over 21566.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3058, pruned_loss=0.08592, over 4266733.03 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:57:04,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=479454.0, ans=0.0 2023-06-20 03:57:31,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=479574.0, ans=0.0 2023-06-20 03:57:57,226 INFO [train.py:996] (0/4) Epoch 3, batch 18950, loss[loss=0.2456, simple_loss=0.3035, pruned_loss=0.09387, over 21689.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3066, pruned_loss=0.08797, over 4272884.58 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 03:58:02,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=479634.0, ans=0.0 2023-06-20 03:59:30,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=479874.0, ans=0.0 2023-06-20 03:59:36,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.641e+02 3.073e+02 3.851e+02 7.117e+02, threshold=6.146e+02, percent-clipped=12.0 2023-06-20 03:59:48,125 INFO [train.py:996] (0/4) Epoch 3, batch 19000, loss[loss=0.2851, simple_loss=0.3493, pruned_loss=0.1105, over 21775.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3163, pruned_loss=0.08931, over 4272311.56 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-20 04:00:05,690 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:00:13,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-20 04:00:20,332 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-80000.pt 2023-06-20 04:00:26,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=479994.0, ans=0.125 2023-06-20 04:00:27,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=479994.0, ans=0.0 2023-06-20 04:00:41,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=480054.0, ans=0.125 2023-06-20 04:01:35,657 INFO [train.py:996] (0/4) Epoch 3, batch 19050, loss[loss=0.2529, simple_loss=0.318, pruned_loss=0.09388, over 21959.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3222, pruned_loss=0.09361, over 4278451.21 frames. ], batch size: 113, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:01:38,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=480234.0, ans=0.2 2023-06-20 04:02:21,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.19 vs. limit=10.0 2023-06-20 04:02:30,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-20 04:02:36,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=480414.0, ans=0.0 2023-06-20 04:03:08,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.699e+02 3.066e+02 3.577e+02 5.949e+02, threshold=6.132e+02, percent-clipped=0.0 2023-06-20 04:03:10,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=480474.0, ans=0.2 2023-06-20 04:03:14,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-20 04:03:36,284 INFO [train.py:996] (0/4) Epoch 3, batch 19100, loss[loss=0.2171, simple_loss=0.2737, pruned_loss=0.08029, over 21345.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3203, pruned_loss=0.09431, over 4278927.88 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:03:47,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=480534.0, ans=0.125 2023-06-20 04:03:48,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480534.0, ans=0.1 2023-06-20 04:03:55,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=480534.0, ans=0.125 2023-06-20 04:04:05,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480594.0, ans=0.1 2023-06-20 04:05:09,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-20 04:05:10,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=480774.0, ans=0.0 2023-06-20 04:05:14,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=480774.0, ans=0.2 2023-06-20 04:05:39,991 INFO [train.py:996] (0/4) Epoch 3, batch 19150, loss[loss=0.2474, simple_loss=0.334, pruned_loss=0.08034, over 21426.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3225, pruned_loss=0.09611, over 4280278.53 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:06:00,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=480894.0, ans=0.0 2023-06-20 04:06:09,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=480894.0, ans=0.0 2023-06-20 04:07:17,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.726e+02 3.068e+02 3.801e+02 8.063e+02, threshold=6.136e+02, percent-clipped=5.0 2023-06-20 04:07:30,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481074.0, ans=0.1 2023-06-20 04:07:35,522 INFO [train.py:996] (0/4) Epoch 3, batch 19200, loss[loss=0.3242, simple_loss=0.4126, pruned_loss=0.1179, over 21626.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3316, pruned_loss=0.0959, over 4276783.53 frames. ], batch size: 441, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:09:03,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=481374.0, ans=0.0 2023-06-20 04:09:14,903 INFO [train.py:996] (0/4) Epoch 3, batch 19250, loss[loss=0.2155, simple_loss=0.2969, pruned_loss=0.06703, over 21641.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3286, pruned_loss=0.08989, over 4265130.91 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:09:24,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=481434.0, ans=0.125 2023-06-20 04:09:47,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=481554.0, ans=0.125 2023-06-20 04:09:50,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=481554.0, ans=0.0 2023-06-20 04:10:06,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=481614.0, ans=0.0 2023-06-20 04:10:13,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 04:10:28,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 2.035e+02 2.594e+02 2.988e+02 5.303e+02, threshold=5.187e+02, percent-clipped=0.0 2023-06-20 04:10:49,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=481734.0, ans=0.0 2023-06-20 04:10:50,954 INFO [train.py:996] (0/4) Epoch 3, batch 19300, loss[loss=0.2155, simple_loss=0.2841, pruned_loss=0.07347, over 21408.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3255, pruned_loss=0.08993, over 4273795.75 frames. ], batch size: 131, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:11:39,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=481854.0, ans=0.0 2023-06-20 04:12:18,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=481974.0, ans=0.2 2023-06-20 04:12:29,534 INFO [train.py:996] (0/4) Epoch 3, batch 19350, loss[loss=0.1872, simple_loss=0.2665, pruned_loss=0.05396, over 21579.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3198, pruned_loss=0.08564, over 4268882.89 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:13:13,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=482154.0, ans=10.0 2023-06-20 04:13:39,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=482274.0, ans=6.0 2023-06-20 04:13:46,941 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.454e+02 2.758e+02 3.131e+02 4.952e+02, threshold=5.516e+02, percent-clipped=0.0 2023-06-20 04:14:04,530 INFO [train.py:996] (0/4) Epoch 3, batch 19400, loss[loss=0.2758, simple_loss=0.345, pruned_loss=0.1034, over 21846.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3155, pruned_loss=0.08343, over 4272647.90 frames. ], batch size: 118, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:14:19,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=482394.0, ans=0.0 2023-06-20 04:14:53,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=22.5 2023-06-20 04:15:52,992 INFO [train.py:996] (0/4) Epoch 3, batch 19450, loss[loss=0.2171, simple_loss=0.2761, pruned_loss=0.07911, over 21498.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3147, pruned_loss=0.0865, over 4280756.83 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:16:02,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=482634.0, ans=0.125 2023-06-20 04:16:18,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-20 04:17:12,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.711e+02 3.172e+02 3.955e+02 6.078e+02, threshold=6.345e+02, percent-clipped=2.0 2023-06-20 04:17:30,459 INFO [train.py:996] (0/4) Epoch 3, batch 19500, loss[loss=0.243, simple_loss=0.3143, pruned_loss=0.08585, over 21808.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3103, pruned_loss=0.08784, over 4284643.14 frames. ], batch size: 372, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:17:40,049 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:17:44,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=482994.0, ans=0.0 2023-06-20 04:17:47,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=482994.0, ans=0.2 2023-06-20 04:18:47,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-20 04:19:28,232 INFO [train.py:996] (0/4) Epoch 3, batch 19550, loss[loss=0.2168, simple_loss=0.302, pruned_loss=0.06583, over 21838.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3071, pruned_loss=0.08613, over 4282254.99 frames. ], batch size: 371, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:19:37,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=483234.0, ans=0.125 2023-06-20 04:20:00,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=483294.0, ans=0.2 2023-06-20 04:20:26,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-20 04:20:47,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-20 04:21:03,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.587e+02 3.203e+02 3.916e+02 8.201e+02, threshold=6.407e+02, percent-clipped=3.0 2023-06-20 04:21:15,866 INFO [train.py:996] (0/4) Epoch 3, batch 19600, loss[loss=0.2774, simple_loss=0.3336, pruned_loss=0.1106, over 21764.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3107, pruned_loss=0.08829, over 4287566.43 frames. ], batch size: 389, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:21:35,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-20 04:22:19,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=483714.0, ans=0.07 2023-06-20 04:22:27,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=483714.0, ans=0.0 2023-06-20 04:22:42,864 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:22:53,023 INFO [train.py:996] (0/4) Epoch 3, batch 19650, loss[loss=0.2809, simple_loss=0.3394, pruned_loss=0.1112, over 21691.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3162, pruned_loss=0.09216, over 4287282.51 frames. ], batch size: 389, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:23:04,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=483834.0, ans=0.0 2023-06-20 04:23:41,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-20 04:24:35,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 2.819e+02 3.213e+02 4.152e+02 6.048e+02, threshold=6.427e+02, percent-clipped=0.0 2023-06-20 04:24:38,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=484074.0, ans=0.2 2023-06-20 04:25:06,171 INFO [train.py:996] (0/4) Epoch 3, batch 19700, loss[loss=0.2245, simple_loss=0.2769, pruned_loss=0.08608, over 21217.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3206, pruned_loss=0.0934, over 4288422.29 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:25:29,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=484194.0, ans=0.0 2023-06-20 04:26:07,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-20 04:26:45,406 INFO [train.py:996] (0/4) Epoch 3, batch 19750, loss[loss=0.2717, simple_loss=0.3513, pruned_loss=0.09604, over 21590.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3298, pruned_loss=0.09547, over 4281613.27 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:27:06,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=484494.0, ans=0.125 2023-06-20 04:27:33,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=484554.0, ans=0.125 2023-06-20 04:27:57,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-20 04:28:26,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.683e+02 3.196e+02 4.002e+02 6.809e+02, threshold=6.392e+02, percent-clipped=1.0 2023-06-20 04:28:38,973 INFO [train.py:996] (0/4) Epoch 3, batch 19800, loss[loss=0.1825, simple_loss=0.2181, pruned_loss=0.07343, over 16581.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3302, pruned_loss=0.09568, over 4282771.29 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:29:28,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=484854.0, ans=0.125 2023-06-20 04:29:35,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=484854.0, ans=0.0 2023-06-20 04:29:36,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=484854.0, ans=0.2 2023-06-20 04:29:43,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=484914.0, ans=0.1 2023-06-20 04:30:04,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=484974.0, ans=0.125 2023-06-20 04:30:10,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=484974.0, ans=0.125 2023-06-20 04:30:21,941 INFO [train.py:996] (0/4) Epoch 3, batch 19850, loss[loss=0.2053, simple_loss=0.292, pruned_loss=0.05931, over 21616.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3235, pruned_loss=0.09116, over 4283686.31 frames. ], batch size: 263, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:30:33,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=485034.0, ans=0.0 2023-06-20 04:31:03,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=485154.0, ans=0.0 2023-06-20 04:31:14,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=485154.0, ans=0.0 2023-06-20 04:31:18,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=485214.0, ans=0.125 2023-06-20 04:31:34,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=485214.0, ans=0.125 2023-06-20 04:31:36,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=485214.0, ans=0.125 2023-06-20 04:31:41,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.292e+02 2.688e+02 3.533e+02 4.969e+02, threshold=5.376e+02, percent-clipped=0.0 2023-06-20 04:31:43,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-20 04:31:49,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2023-06-20 04:31:59,296 INFO [train.py:996] (0/4) Epoch 3, batch 19900, loss[loss=0.1705, simple_loss=0.2405, pruned_loss=0.0503, over 16401.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3207, pruned_loss=0.08673, over 4270936.93 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:32:40,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=485454.0, ans=0.125 2023-06-20 04:33:05,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=485514.0, ans=0.125 2023-06-20 04:33:08,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=485514.0, ans=0.0 2023-06-20 04:33:18,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=485574.0, ans=0.2 2023-06-20 04:33:39,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=22.5 2023-06-20 04:33:42,961 INFO [train.py:996] (0/4) Epoch 3, batch 19950, loss[loss=0.2147, simple_loss=0.2787, pruned_loss=0.07539, over 21681.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3145, pruned_loss=0.0858, over 4260449.10 frames. ], batch size: 333, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:33:45,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=485634.0, ans=0.125 2023-06-20 04:34:07,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=485694.0, ans=0.125 2023-06-20 04:34:10,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=485694.0, ans=0.125 2023-06-20 04:34:34,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=485754.0, ans=0.0 2023-06-20 04:35:02,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.533e+02 3.149e+02 3.836e+02 5.627e+02, threshold=6.299e+02, percent-clipped=1.0 2023-06-20 04:35:19,361 INFO [train.py:996] (0/4) Epoch 3, batch 20000, loss[loss=0.2747, simple_loss=0.3445, pruned_loss=0.1025, over 21806.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3151, pruned_loss=0.08645, over 4261832.92 frames. ], batch size: 414, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:35:45,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=485994.0, ans=0.125 2023-06-20 04:36:10,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=486054.0, ans=0.0 2023-06-20 04:36:21,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=486114.0, ans=0.0 2023-06-20 04:36:24,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=486114.0, ans=0.0 2023-06-20 04:36:37,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=486174.0, ans=10.0 2023-06-20 04:36:53,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=486234.0, ans=0.0 2023-06-20 04:36:54,200 INFO [train.py:996] (0/4) Epoch 3, batch 20050, loss[loss=0.2381, simple_loss=0.3051, pruned_loss=0.08559, over 21639.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3174, pruned_loss=0.08909, over 4273964.17 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:37:00,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=486234.0, ans=0.2 2023-06-20 04:37:29,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=486354.0, ans=0.125 2023-06-20 04:37:29,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=486354.0, ans=0.07 2023-06-20 04:37:31,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=486354.0, ans=0.125 2023-06-20 04:37:33,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-20 04:37:41,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486354.0, ans=0.1 2023-06-20 04:38:39,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.657e+02 3.135e+02 3.679e+02 6.652e+02, threshold=6.270e+02, percent-clipped=1.0 2023-06-20 04:38:43,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=486474.0, ans=0.125 2023-06-20 04:38:51,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=486534.0, ans=0.125 2023-06-20 04:38:52,020 INFO [train.py:996] (0/4) Epoch 3, batch 20100, loss[loss=0.2715, simple_loss=0.3556, pruned_loss=0.09372, over 21075.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3207, pruned_loss=0.09185, over 4279960.30 frames. ], batch size: 607, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:39:00,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486534.0, ans=0.1 2023-06-20 04:39:07,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=486534.0, ans=0.2 2023-06-20 04:39:36,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=486654.0, ans=0.05 2023-06-20 04:39:43,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=486654.0, ans=0.125 2023-06-20 04:40:10,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=486714.0, ans=0.025 2023-06-20 04:40:17,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-20 04:40:31,378 INFO [train.py:996] (0/4) Epoch 3, batch 20150, loss[loss=0.3009, simple_loss=0.3614, pruned_loss=0.1202, over 21466.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3308, pruned_loss=0.09615, over 4276392.58 frames. ], batch size: 194, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:40:42,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=486834.0, ans=0.125 2023-06-20 04:40:42,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486834.0, ans=0.1 2023-06-20 04:41:38,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=22.5 2023-06-20 04:41:46,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=487014.0, ans=0.125 2023-06-20 04:41:51,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=487014.0, ans=0.0 2023-06-20 04:42:17,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.160e+02 3.625e+02 4.244e+02 7.181e+02, threshold=7.250e+02, percent-clipped=1.0 2023-06-20 04:42:37,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=487074.0, ans=0.2 2023-06-20 04:42:40,025 INFO [train.py:996] (0/4) Epoch 3, batch 20200, loss[loss=0.3754, simple_loss=0.4441, pruned_loss=0.1533, over 21442.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3373, pruned_loss=0.09953, over 4274490.39 frames. ], batch size: 507, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:42:57,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=487194.0, ans=0.0 2023-06-20 04:43:22,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=487254.0, ans=0.05 2023-06-20 04:44:05,945 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:44:29,140 INFO [train.py:996] (0/4) Epoch 3, batch 20250, loss[loss=0.2453, simple_loss=0.3237, pruned_loss=0.08347, over 21844.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3373, pruned_loss=0.09699, over 4281210.36 frames. ], batch size: 316, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:45:05,713 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:45:09,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=487554.0, ans=0.125 2023-06-20 04:45:16,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=487554.0, ans=0.125 2023-06-20 04:45:55,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=487614.0, ans=0.04949747468305833 2023-06-20 04:45:57,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-20 04:46:07,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.337e+02 2.707e+02 3.464e+02 5.836e+02, threshold=5.414e+02, percent-clipped=0.0 2023-06-20 04:46:19,806 INFO [train.py:996] (0/4) Epoch 3, batch 20300, loss[loss=0.2269, simple_loss=0.3074, pruned_loss=0.07318, over 21552.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3339, pruned_loss=0.09377, over 4275179.70 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:47:10,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=487854.0, ans=0.2 2023-06-20 04:47:30,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=487914.0, ans=0.2 2023-06-20 04:47:36,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=487974.0, ans=15.0 2023-06-20 04:47:56,604 INFO [train.py:996] (0/4) Epoch 3, batch 20350, loss[loss=0.2539, simple_loss=0.3202, pruned_loss=0.09375, over 21899.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3335, pruned_loss=0.09375, over 4268050.30 frames. ], batch size: 118, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:48:10,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488094.0, ans=0.1 2023-06-20 04:48:54,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=488214.0, ans=0.05 2023-06-20 04:49:06,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-20 04:49:21,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488274.0, ans=0.1 2023-06-20 04:49:28,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.591e+02 2.979e+02 3.665e+02 6.122e+02, threshold=5.958e+02, percent-clipped=2.0 2023-06-20 04:49:41,478 INFO [train.py:996] (0/4) Epoch 3, batch 20400, loss[loss=0.2948, simple_loss=0.362, pruned_loss=0.1138, over 20782.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3365, pruned_loss=0.0972, over 4264264.67 frames. ], batch size: 608, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:49:41,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=488334.0, ans=0.125 2023-06-20 04:50:12,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-20 04:50:15,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-20 04:50:53,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488454.0, ans=0.1 2023-06-20 04:51:07,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=488514.0, ans=0.025 2023-06-20 04:51:46,586 INFO [train.py:996] (0/4) Epoch 3, batch 20450, loss[loss=0.2173, simple_loss=0.2531, pruned_loss=0.09076, over 20099.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3385, pruned_loss=0.1004, over 4263735.25 frames. ], batch size: 703, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:52:01,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=488694.0, ans=0.125 2023-06-20 04:52:13,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=488694.0, ans=0.0 2023-06-20 04:52:48,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=488814.0, ans=0.125 2023-06-20 04:53:14,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.891e+02 3.490e+02 4.118e+02 8.011e+02, threshold=6.980e+02, percent-clipped=6.0 2023-06-20 04:53:16,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488874.0, ans=0.1 2023-06-20 04:53:24,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=488874.0, ans=0.2 2023-06-20 04:53:26,652 INFO [train.py:996] (0/4) Epoch 3, batch 20500, loss[loss=0.263, simple_loss=0.3182, pruned_loss=0.1039, over 21869.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3338, pruned_loss=0.1003, over 4257195.81 frames. ], batch size: 371, lr: 1.06e-02, grad_scale: 32.0 2023-06-20 04:53:39,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=488994.0, ans=0.125 2023-06-20 04:54:05,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=12.0 2023-06-20 04:54:07,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=489054.0, ans=0.125 2023-06-20 04:54:43,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=489114.0, ans=0.2 2023-06-20 04:55:15,471 INFO [train.py:996] (0/4) Epoch 3, batch 20550, loss[loss=0.238, simple_loss=0.2951, pruned_loss=0.09045, over 21867.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3264, pruned_loss=0.09879, over 4262990.07 frames. ], batch size: 107, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:55:19,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-20 04:55:20,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-20 04:56:08,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=489354.0, ans=0.0 2023-06-20 04:56:48,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.383e+02 2.780e+02 3.165e+02 5.319e+02, threshold=5.560e+02, percent-clipped=0.0 2023-06-20 04:56:59,055 INFO [train.py:996] (0/4) Epoch 3, batch 20600, loss[loss=0.2505, simple_loss=0.3173, pruned_loss=0.09187, over 21898.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3268, pruned_loss=0.09619, over 4260977.94 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:57:05,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=489534.0, ans=0.125 2023-06-20 04:57:17,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-20 04:57:53,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=489654.0, ans=0.125 2023-06-20 04:58:32,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-20 04:58:44,295 INFO [train.py:996] (0/4) Epoch 3, batch 20650, loss[loss=0.2219, simple_loss=0.2816, pruned_loss=0.08108, over 21174.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3239, pruned_loss=0.09694, over 4266493.63 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 04:59:37,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=489954.0, ans=0.125 2023-06-20 04:59:40,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=490014.0, ans=0.125 2023-06-20 04:59:49,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490014.0, ans=0.1 2023-06-20 04:59:57,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=490014.0, ans=0.07 2023-06-20 05:00:03,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490014.0, ans=0.125 2023-06-20 05:00:10,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.534e+02 3.079e+02 3.590e+02 6.191e+02, threshold=6.158e+02, percent-clipped=1.0 2023-06-20 05:00:14,193 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:00:21,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.13 vs. limit=22.5 2023-06-20 05:00:21,343 INFO [train.py:996] (0/4) Epoch 3, batch 20700, loss[loss=0.2283, simple_loss=0.2998, pruned_loss=0.07841, over 21694.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3146, pruned_loss=0.09216, over 4250868.59 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:00:32,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=490134.0, ans=0.125 2023-06-20 05:00:38,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=490194.0, ans=0.0 2023-06-20 05:01:35,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490314.0, ans=0.1 2023-06-20 05:01:38,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=490314.0, ans=0.09899494936611666 2023-06-20 05:01:45,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-20 05:01:52,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=490374.0, ans=0.125 2023-06-20 05:02:17,376 INFO [train.py:996] (0/4) Epoch 3, batch 20750, loss[loss=0.2647, simple_loss=0.359, pruned_loss=0.08524, over 21231.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3178, pruned_loss=0.09126, over 4255494.53 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:02:43,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-20 05:02:54,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490494.0, ans=0.1 2023-06-20 05:03:56,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.880e+02 3.382e+02 4.040e+02 6.281e+02, threshold=6.763e+02, percent-clipped=1.0 2023-06-20 05:03:57,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=490674.0, ans=0.07 2023-06-20 05:04:10,635 INFO [train.py:996] (0/4) Epoch 3, batch 20800, loss[loss=0.2294, simple_loss=0.2803, pruned_loss=0.0892, over 21402.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.322, pruned_loss=0.09227, over 4257696.33 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:04:20,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=490734.0, ans=0.0 2023-06-20 05:04:32,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-20 05:05:14,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-20 05:05:18,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-20 05:05:43,070 INFO [train.py:996] (0/4) Epoch 3, batch 20850, loss[loss=0.2044, simple_loss=0.2706, pruned_loss=0.06909, over 21629.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3133, pruned_loss=0.08949, over 4266227.96 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:06:04,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=491034.0, ans=0.125 2023-06-20 05:06:33,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=491154.0, ans=0.2 2023-06-20 05:06:48,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=491214.0, ans=0.0 2023-06-20 05:07:05,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=491274.0, ans=0.125 2023-06-20 05:07:09,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.577e+02 3.141e+02 3.837e+02 5.556e+02, threshold=6.283e+02, percent-clipped=0.0 2023-06-20 05:07:25,945 INFO [train.py:996] (0/4) Epoch 3, batch 20900, loss[loss=0.2433, simple_loss=0.3166, pruned_loss=0.08495, over 21820.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.315, pruned_loss=0.09094, over 4265965.39 frames. ], batch size: 124, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:07:26,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=491334.0, ans=0.125 2023-06-20 05:07:59,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=491394.0, ans=0.2 2023-06-20 05:08:00,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=491394.0, ans=0.125 2023-06-20 05:08:05,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-20 05:08:16,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=491454.0, ans=0.125 2023-06-20 05:08:36,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=491514.0, ans=0.2 2023-06-20 05:08:36,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491514.0, ans=0.1 2023-06-20 05:08:55,356 INFO [train.py:996] (0/4) Epoch 3, batch 20950, loss[loss=0.1691, simple_loss=0.2376, pruned_loss=0.0503, over 17056.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3104, pruned_loss=0.08668, over 4261624.98 frames. ], batch size: 64, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:10:00,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=491814.0, ans=0.2 2023-06-20 05:10:14,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=491874.0, ans=0.0 2023-06-20 05:10:15,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=491874.0, ans=0.1 2023-06-20 05:10:19,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.460e+02 2.871e+02 3.245e+02 5.204e+02, threshold=5.741e+02, percent-clipped=0.0 2023-06-20 05:10:29,538 INFO [train.py:996] (0/4) Epoch 3, batch 21000, loss[loss=0.2567, simple_loss=0.3083, pruned_loss=0.1026, over 21576.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3086, pruned_loss=0.08694, over 4252658.95 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:10:29,540 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 05:11:23,168 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2766, simple_loss=0.3765, pruned_loss=0.08831, over 1796401.00 frames. 2023-06-20 05:11:23,169 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 05:12:03,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=492054.0, ans=0.125 2023-06-20 05:12:22,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=492054.0, ans=0.0 2023-06-20 05:12:26,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=492114.0, ans=0.125 2023-06-20 05:12:33,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=492114.0, ans=22.5 2023-06-20 05:12:59,921 INFO [train.py:996] (0/4) Epoch 3, batch 21050, loss[loss=0.2447, simple_loss=0.3021, pruned_loss=0.09363, over 21732.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3068, pruned_loss=0.08755, over 4258514.66 frames. ], batch size: 351, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:13:04,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=492234.0, ans=0.0 2023-06-20 05:13:31,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492294.0, ans=0.1 2023-06-20 05:13:35,918 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:14:02,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=492414.0, ans=0.125 2023-06-20 05:14:21,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.359e+02 2.713e+02 3.302e+02 4.914e+02, threshold=5.427e+02, percent-clipped=0.0 2023-06-20 05:14:30,221 INFO [train.py:996] (0/4) Epoch 3, batch 21100, loss[loss=0.2232, simple_loss=0.2756, pruned_loss=0.08536, over 21513.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3036, pruned_loss=0.08719, over 4256275.37 frames. ], batch size: 196, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:14:37,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=492534.0, ans=0.125 2023-06-20 05:14:40,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=492534.0, ans=0.125 2023-06-20 05:16:26,483 INFO [train.py:996] (0/4) Epoch 3, batch 21150, loss[loss=0.2548, simple_loss=0.2951, pruned_loss=0.1073, over 21329.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3007, pruned_loss=0.08811, over 4256352.13 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:17:48,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.584e+02 3.052e+02 3.730e+02 6.126e+02, threshold=6.104e+02, percent-clipped=6.0 2023-06-20 05:17:49,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493074.0, ans=0.1 2023-06-20 05:18:03,239 INFO [train.py:996] (0/4) Epoch 3, batch 21200, loss[loss=0.2152, simple_loss=0.28, pruned_loss=0.0752, over 21425.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.2971, pruned_loss=0.08772, over 4256401.25 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:18:05,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.25 vs. limit=6.0 2023-06-20 05:18:21,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493134.0, ans=0.1 2023-06-20 05:18:34,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=493194.0, ans=0.125 2023-06-20 05:18:47,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=493254.0, ans=0.2 2023-06-20 05:18:55,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2023-06-20 05:19:07,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=493314.0, ans=0.025 2023-06-20 05:19:16,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-20 05:19:32,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493434.0, ans=0.1 2023-06-20 05:19:39,358 INFO [train.py:996] (0/4) Epoch 3, batch 21250, loss[loss=0.224, simple_loss=0.2929, pruned_loss=0.07756, over 21669.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2952, pruned_loss=0.08671, over 4265658.83 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:19:40,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-20 05:21:07,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.549e+02 3.045e+02 3.638e+02 6.248e+02, threshold=6.090e+02, percent-clipped=1.0 2023-06-20 05:21:12,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=493674.0, ans=0.07 2023-06-20 05:21:15,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-20 05:21:21,845 INFO [train.py:996] (0/4) Epoch 3, batch 21300, loss[loss=0.2545, simple_loss=0.3245, pruned_loss=0.09221, over 21824.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.301, pruned_loss=0.08823, over 4257881.64 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:21:33,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=493734.0, ans=0.0 2023-06-20 05:21:35,216 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:21:48,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-20 05:22:03,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-20 05:22:04,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493794.0, ans=0.1 2023-06-20 05:22:04,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-20 05:22:26,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=493854.0, ans=0.125 2023-06-20 05:23:05,084 INFO [train.py:996] (0/4) Epoch 3, batch 21350, loss[loss=0.2816, simple_loss=0.355, pruned_loss=0.1041, over 21540.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3058, pruned_loss=0.08947, over 4262061.69 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:23:05,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=494034.0, ans=0.0 2023-06-20 05:23:26,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-20 05:23:42,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=494094.0, ans=0.2 2023-06-20 05:24:26,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-06-20 05:24:57,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.570e+02 2.839e+02 3.402e+02 4.557e+02, threshold=5.677e+02, percent-clipped=0.0 2023-06-20 05:25:16,171 INFO [train.py:996] (0/4) Epoch 3, batch 21400, loss[loss=0.2914, simple_loss=0.3554, pruned_loss=0.1137, over 21323.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3124, pruned_loss=0.09075, over 4269688.52 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:25:27,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=494334.0, ans=0.125 2023-06-20 05:27:02,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=494574.0, ans=0.125 2023-06-20 05:27:20,262 INFO [train.py:996] (0/4) Epoch 3, batch 21450, loss[loss=0.2304, simple_loss=0.3002, pruned_loss=0.08028, over 21328.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3154, pruned_loss=0.09203, over 4277163.57 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:27:37,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=494694.0, ans=0.0 2023-06-20 05:27:59,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=494754.0, ans=0.0 2023-06-20 05:28:18,652 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:28:29,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=494874.0, ans=0.0 2023-06-20 05:28:38,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.516e+02 2.861e+02 3.352e+02 5.412e+02, threshold=5.722e+02, percent-clipped=0.0 2023-06-20 05:28:38,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=494874.0, ans=0.0 2023-06-20 05:28:52,136 INFO [train.py:996] (0/4) Epoch 3, batch 21500, loss[loss=0.2331, simple_loss=0.2877, pruned_loss=0.08926, over 21225.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3129, pruned_loss=0.09332, over 4281854.87 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:29:38,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=495054.0, ans=0.125 2023-06-20 05:30:30,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495174.0, ans=0.1 2023-06-20 05:30:42,578 INFO [train.py:996] (0/4) Epoch 3, batch 21550, loss[loss=0.1885, simple_loss=0.2603, pruned_loss=0.05831, over 21693.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3056, pruned_loss=0.09048, over 4277932.42 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:31:13,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=495294.0, ans=0.0 2023-06-20 05:31:30,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-20 05:31:36,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-20 05:32:24,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.424e+02 3.051e+02 3.633e+02 8.477e+02, threshold=6.101e+02, percent-clipped=5.0 2023-06-20 05:32:38,621 INFO [train.py:996] (0/4) Epoch 3, batch 21600, loss[loss=0.2396, simple_loss=0.2904, pruned_loss=0.09441, over 21578.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3025, pruned_loss=0.08853, over 4275462.20 frames. ], batch size: 415, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:32:56,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.24 vs. limit=22.5 2023-06-20 05:33:11,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=495594.0, ans=0.0 2023-06-20 05:33:19,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495654.0, ans=0.125 2023-06-20 05:33:50,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=495714.0, ans=0.0 2023-06-20 05:34:27,997 INFO [train.py:996] (0/4) Epoch 3, batch 21650, loss[loss=0.3276, simple_loss=0.4012, pruned_loss=0.127, over 21479.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3071, pruned_loss=0.08624, over 4269454.40 frames. ], batch size: 507, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:34:57,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=495894.0, ans=0.2 2023-06-20 05:35:03,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495954.0, ans=0.1 2023-06-20 05:35:34,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=496014.0, ans=0.125 2023-06-20 05:35:34,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-20 05:35:39,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-20 05:35:50,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.412e+02 2.832e+02 3.352e+02 7.209e+02, threshold=5.664e+02, percent-clipped=3.0 2023-06-20 05:35:58,263 INFO [train.py:996] (0/4) Epoch 3, batch 21700, loss[loss=0.2574, simple_loss=0.2981, pruned_loss=0.1084, over 20225.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3063, pruned_loss=0.08434, over 4274852.76 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:36:37,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=496254.0, ans=0.2 2023-06-20 05:37:02,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=22.5 2023-06-20 05:37:03,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=496374.0, ans=0.125 2023-06-20 05:37:33,337 INFO [train.py:996] (0/4) Epoch 3, batch 21750, loss[loss=0.2068, simple_loss=0.2657, pruned_loss=0.07393, over 21236.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.302, pruned_loss=0.08392, over 4256622.97 frames. ], batch size: 144, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:37:45,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=496434.0, ans=0.125 2023-06-20 05:38:07,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496554.0, ans=0.0 2023-06-20 05:38:07,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=496554.0, ans=0.0 2023-06-20 05:38:07,230 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:38:14,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=496554.0, ans=0.2 2023-06-20 05:38:22,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-20 05:38:23,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=496614.0, ans=0.0 2023-06-20 05:38:25,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=496614.0, ans=0.2 2023-06-20 05:38:25,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=496614.0, ans=0.0 2023-06-20 05:39:02,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.434e+02 2.778e+02 3.312e+02 5.034e+02, threshold=5.556e+02, percent-clipped=0.0 2023-06-20 05:39:10,182 INFO [train.py:996] (0/4) Epoch 3, batch 21800, loss[loss=0.2253, simple_loss=0.2801, pruned_loss=0.08524, over 21509.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2997, pruned_loss=0.08535, over 4238668.07 frames. ], batch size: 263, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:39:53,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=496854.0, ans=0.125 2023-06-20 05:39:54,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-20 05:40:08,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=496914.0, ans=0.0 2023-06-20 05:40:15,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496914.0, ans=0.125 2023-06-20 05:40:45,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496974.0, ans=0.0 2023-06-20 05:40:52,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=496974.0, ans=0.125 2023-06-20 05:40:56,341 INFO [train.py:996] (0/4) Epoch 3, batch 21850, loss[loss=0.2703, simple_loss=0.3281, pruned_loss=0.1062, over 21873.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3068, pruned_loss=0.08625, over 4250289.26 frames. ], batch size: 107, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:41:02,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=497034.0, ans=0.5 2023-06-20 05:41:47,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497154.0, ans=0.1 2023-06-20 05:41:49,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=497154.0, ans=0.125 2023-06-20 05:41:51,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 05:42:40,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.614e+02 3.041e+02 3.805e+02 7.327e+02, threshold=6.083e+02, percent-clipped=2.0 2023-06-20 05:42:46,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=497274.0, ans=0.2 2023-06-20 05:42:48,469 INFO [train.py:996] (0/4) Epoch 3, batch 21900, loss[loss=0.2181, simple_loss=0.2912, pruned_loss=0.07246, over 21792.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3077, pruned_loss=0.08722, over 4264592.72 frames. ], batch size: 112, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:43:06,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=497334.0, ans=0.09899494936611666 2023-06-20 05:44:27,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=15.0 2023-06-20 05:44:33,683 INFO [train.py:996] (0/4) Epoch 3, batch 21950, loss[loss=0.1697, simple_loss=0.2442, pruned_loss=0.04759, over 21493.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3008, pruned_loss=0.08479, over 4271819.79 frames. ], batch size: 212, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:44:49,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=497634.0, ans=0.025 2023-06-20 05:44:53,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-20 05:44:58,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=497694.0, ans=0.0 2023-06-20 05:46:12,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=497874.0, ans=0.125 2023-06-20 05:46:13,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 2.252e+02 2.673e+02 3.306e+02 5.194e+02, threshold=5.347e+02, percent-clipped=0.0 2023-06-20 05:46:21,085 INFO [train.py:996] (0/4) Epoch 3, batch 22000, loss[loss=0.2224, simple_loss=0.2933, pruned_loss=0.07575, over 21484.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2967, pruned_loss=0.08316, over 4272977.09 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 32.0 2023-06-20 05:46:33,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-20 05:46:38,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=497934.0, ans=0.125 2023-06-20 05:46:43,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=497994.0, ans=0.0 2023-06-20 05:48:03,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-20 05:48:03,700 INFO [train.py:996] (0/4) Epoch 3, batch 22050, loss[loss=0.2715, simple_loss=0.3468, pruned_loss=0.09814, over 21720.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.304, pruned_loss=0.08559, over 4277229.19 frames. ], batch size: 247, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:49:28,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=498354.0, ans=0.125 2023-06-20 05:49:43,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=498414.0, ans=0.02 2023-06-20 05:50:01,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 3.110e+02 3.918e+02 4.887e+02 9.595e+02, threshold=7.836e+02, percent-clipped=17.0 2023-06-20 05:50:01,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=498474.0, ans=0.125 2023-06-20 05:50:07,367 INFO [train.py:996] (0/4) Epoch 3, batch 22100, loss[loss=0.2436, simple_loss=0.309, pruned_loss=0.08908, over 21924.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3144, pruned_loss=0.09094, over 4263944.61 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 16.0 2023-06-20 05:50:19,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-20 05:50:21,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498594.0, ans=0.1 2023-06-20 05:51:32,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-20 05:51:50,247 INFO [train.py:996] (0/4) Epoch 3, batch 22150, loss[loss=0.2397, simple_loss=0.3254, pruned_loss=0.07699, over 21475.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3173, pruned_loss=0.09303, over 4272839.99 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:51:54,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-20 05:52:17,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498894.0, ans=0.125 2023-06-20 05:52:26,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-20 05:52:45,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=498954.0, ans=0.0 2023-06-20 05:53:13,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=499014.0, ans=0.125 2023-06-20 05:53:15,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=499014.0, ans=0.125 2023-06-20 05:53:19,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499074.0, ans=0.1 2023-06-20 05:53:30,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.905e+02 3.355e+02 4.260e+02 6.840e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-20 05:53:41,752 INFO [train.py:996] (0/4) Epoch 3, batch 22200, loss[loss=0.319, simple_loss=0.3586, pruned_loss=0.1397, over 21760.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.32, pruned_loss=0.09366, over 4279672.80 frames. ], batch size: 508, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:53:51,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=499134.0, ans=0.0 2023-06-20 05:53:54,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=499134.0, ans=0.125 2023-06-20 05:54:03,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=499194.0, ans=0.125 2023-06-20 05:54:25,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=499254.0, ans=0.125 2023-06-20 05:54:33,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-20 05:54:43,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=499314.0, ans=0.125 2023-06-20 05:54:52,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=499314.0, ans=0.125 2023-06-20 05:55:27,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=499374.0, ans=0.07 2023-06-20 05:55:27,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=499374.0, ans=0.2 2023-06-20 05:55:37,074 INFO [train.py:996] (0/4) Epoch 3, batch 22250, loss[loss=0.2652, simple_loss=0.3265, pruned_loss=0.102, over 21772.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.327, pruned_loss=0.09612, over 4286701.85 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:56:27,344 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:56:33,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=499614.0, ans=0.125 2023-06-20 05:56:56,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.809e+02 3.391e+02 3.899e+02 6.756e+02, threshold=6.782e+02, percent-clipped=1.0 2023-06-20 05:57:08,135 INFO [train.py:996] (0/4) Epoch 3, batch 22300, loss[loss=0.2724, simple_loss=0.3229, pruned_loss=0.111, over 21389.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3292, pruned_loss=0.09785, over 4289615.53 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:57:15,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499734.0, ans=0.1 2023-06-20 05:57:23,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=499794.0, ans=0.0 2023-06-20 05:57:39,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499854.0, ans=0.1 2023-06-20 05:58:16,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=499914.0, ans=0.125 2023-06-20 05:58:19,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=499914.0, ans=0.125 2023-06-20 05:58:42,885 INFO [train.py:996] (0/4) Epoch 3, batch 22350, loss[loss=0.2357, simple_loss=0.3067, pruned_loss=0.08233, over 21917.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.326, pruned_loss=0.09814, over 4296424.19 frames. ], batch size: 333, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 05:58:52,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=500034.0, ans=0.125 2023-06-20 05:59:00,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500094.0, ans=0.125 2023-06-20 05:59:01,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500094.0, ans=0.1 2023-06-20 05:59:23,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=500154.0, ans=0.0 2023-06-20 05:59:48,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=22.5 2023-06-20 06:00:13,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.473e+02 2.755e+02 3.372e+02 7.896e+02, threshold=5.510e+02, percent-clipped=3.0 2023-06-20 06:00:19,818 INFO [train.py:996] (0/4) Epoch 3, batch 22400, loss[loss=0.2321, simple_loss=0.3115, pruned_loss=0.07638, over 21629.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3224, pruned_loss=0.09415, over 4291167.33 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:01:23,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=500514.0, ans=0.2 2023-06-20 06:02:04,300 INFO [train.py:996] (0/4) Epoch 3, batch 22450, loss[loss=0.224, simple_loss=0.2821, pruned_loss=0.08296, over 21177.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3151, pruned_loss=0.09256, over 4289470.80 frames. ], batch size: 549, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:02:31,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=500634.0, ans=0.0 2023-06-20 06:02:36,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=500694.0, ans=0.2 2023-06-20 06:03:10,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=500754.0, ans=0.125 2023-06-20 06:03:13,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500754.0, ans=0.125 2023-06-20 06:03:50,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.596e+02 2.879e+02 3.313e+02 5.071e+02, threshold=5.757e+02, percent-clipped=0.0 2023-06-20 06:03:50,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=500874.0, ans=0.0 2023-06-20 06:03:56,320 INFO [train.py:996] (0/4) Epoch 3, batch 22500, loss[loss=0.2689, simple_loss=0.3092, pruned_loss=0.1143, over 20091.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3114, pruned_loss=0.09292, over 4287065.69 frames. ], batch size: 702, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:04:01,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=500934.0, ans=0.125 2023-06-20 06:04:02,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=500934.0, ans=0.0 2023-06-20 06:04:34,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=500994.0, ans=0.2 2023-06-20 06:04:44,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=501054.0, ans=0.035 2023-06-20 06:04:53,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=501054.0, ans=0.125 2023-06-20 06:05:28,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=501174.0, ans=0.125 2023-06-20 06:05:39,784 INFO [train.py:996] (0/4) Epoch 3, batch 22550, loss[loss=0.2274, simple_loss=0.2978, pruned_loss=0.0785, over 21923.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3145, pruned_loss=0.09285, over 4289723.32 frames. ], batch size: 299, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:06:19,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=501294.0, ans=0.07 2023-06-20 06:06:29,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=501294.0, ans=0.04949747468305833 2023-06-20 06:06:43,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=501354.0, ans=0.0 2023-06-20 06:07:01,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-20 06:07:09,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=501414.0, ans=0.0 2023-06-20 06:07:12,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=501414.0, ans=0.2 2023-06-20 06:07:33,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.773e+02 3.379e+02 4.240e+02 8.103e+02, threshold=6.757e+02, percent-clipped=8.0 2023-06-20 06:07:43,148 INFO [train.py:996] (0/4) Epoch 3, batch 22600, loss[loss=0.2049, simple_loss=0.2694, pruned_loss=0.07024, over 21800.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3164, pruned_loss=0.09365, over 4287130.72 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:07:45,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=501534.0, ans=0.125 2023-06-20 06:08:27,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501654.0, ans=0.1 2023-06-20 06:08:46,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=501714.0, ans=0.125 2023-06-20 06:08:49,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=501714.0, ans=0.0 2023-06-20 06:08:53,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501774.0, ans=0.1 2023-06-20 06:09:25,247 INFO [train.py:996] (0/4) Epoch 3, batch 22650, loss[loss=0.2636, simple_loss=0.3081, pruned_loss=0.1095, over 21475.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3142, pruned_loss=0.09271, over 4286772.00 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:09:36,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501834.0, ans=0.1 2023-06-20 06:09:48,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=501894.0, ans=0.125 2023-06-20 06:11:08,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.541e+02 2.941e+02 3.391e+02 5.583e+02, threshold=5.883e+02, percent-clipped=0.0 2023-06-20 06:11:18,223 INFO [train.py:996] (0/4) Epoch 3, batch 22700, loss[loss=0.2293, simple_loss=0.2871, pruned_loss=0.08576, over 21317.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.308, pruned_loss=0.09237, over 4275591.32 frames. ], batch size: 131, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:11:48,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=502194.0, ans=0.0 2023-06-20 06:12:02,479 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:12:04,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=502254.0, ans=0.125 2023-06-20 06:12:45,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=502374.0, ans=0.125 2023-06-20 06:12:48,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=502374.0, ans=0.125 2023-06-20 06:13:09,027 INFO [train.py:996] (0/4) Epoch 3, batch 22750, loss[loss=0.2613, simple_loss=0.3226, pruned_loss=0.09998, over 21200.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3084, pruned_loss=0.09374, over 4272799.65 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 16.0 2023-06-20 06:13:31,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502494.0, ans=0.1 2023-06-20 06:14:06,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=502554.0, ans=0.0 2023-06-20 06:14:21,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=502614.0, ans=0.0 2023-06-20 06:14:51,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.900e+02 3.287e+02 3.902e+02 7.614e+02, threshold=6.575e+02, percent-clipped=5.0 2023-06-20 06:15:01,371 INFO [train.py:996] (0/4) Epoch 3, batch 22800, loss[loss=0.2634, simple_loss=0.3253, pruned_loss=0.1008, over 21337.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3137, pruned_loss=0.09666, over 4271791.85 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:15:05,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-20 06:15:40,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=502854.0, ans=0.0 2023-06-20 06:15:52,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=502914.0, ans=0.0 2023-06-20 06:16:07,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=502974.0, ans=0.125 2023-06-20 06:16:32,952 INFO [train.py:996] (0/4) Epoch 3, batch 22850, loss[loss=0.2635, simple_loss=0.3157, pruned_loss=0.1057, over 21236.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3106, pruned_loss=0.09547, over 4268239.73 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:16:34,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=503034.0, ans=0.125 2023-06-20 06:16:39,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=503034.0, ans=0.0 2023-06-20 06:16:59,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=503094.0, ans=0.125 2023-06-20 06:17:06,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=503094.0, ans=0.0 2023-06-20 06:17:12,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=503154.0, ans=0.125 2023-06-20 06:17:23,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=503154.0, ans=0.125 2023-06-20 06:17:36,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=503214.0, ans=0.0 2023-06-20 06:17:36,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=503214.0, ans=0.0 2023-06-20 06:17:37,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=503214.0, ans=0.2 2023-06-20 06:17:59,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.770e+02 3.456e+02 4.106e+02 7.202e+02, threshold=6.912e+02, percent-clipped=4.0 2023-06-20 06:18:10,135 INFO [train.py:996] (0/4) Epoch 3, batch 22900, loss[loss=0.1995, simple_loss=0.2589, pruned_loss=0.07002, over 21758.00 frames. ], tot_loss[loss=0.252, simple_loss=0.314, pruned_loss=0.095, over 4271139.43 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:18:10,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-20 06:18:19,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=503334.0, ans=0.0 2023-06-20 06:18:44,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=503394.0, ans=0.1 2023-06-20 06:19:12,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=503514.0, ans=0.125 2023-06-20 06:19:43,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-20 06:20:01,164 INFO [train.py:996] (0/4) Epoch 3, batch 22950, loss[loss=0.2302, simple_loss=0.3258, pruned_loss=0.06733, over 21229.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3272, pruned_loss=0.0936, over 4271028.57 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:20:50,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=503754.0, ans=0.0 2023-06-20 06:20:51,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-20 06:21:07,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=503814.0, ans=0.95 2023-06-20 06:21:17,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=503814.0, ans=0.125 2023-06-20 06:21:51,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=503874.0, ans=0.09899494936611666 2023-06-20 06:21:53,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=503874.0, ans=0.125 2023-06-20 06:21:57,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.454e+02 2.874e+02 3.736e+02 7.174e+02, threshold=5.748e+02, percent-clipped=1.0 2023-06-20 06:22:01,686 INFO [train.py:996] (0/4) Epoch 3, batch 23000, loss[loss=0.2354, simple_loss=0.3036, pruned_loss=0.08362, over 21798.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3263, pruned_loss=0.09158, over 4275296.42 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:22:22,211 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-84000.pt 2023-06-20 06:22:27,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-20 06:23:03,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-20 06:23:45,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=504174.0, ans=0.125 2023-06-20 06:23:48,139 INFO [train.py:996] (0/4) Epoch 3, batch 23050, loss[loss=0.2894, simple_loss=0.3523, pruned_loss=0.1132, over 21700.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.328, pruned_loss=0.09373, over 4281447.82 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:23:53,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=504234.0, ans=0.0 2023-06-20 06:23:53,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=504234.0, ans=0.125 2023-06-20 06:24:14,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-20 06:24:15,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-20 06:24:16,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=504354.0, ans=0.125 2023-06-20 06:25:17,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-06-20 06:25:23,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.688e+02 2.976e+02 3.405e+02 5.912e+02, threshold=5.952e+02, percent-clipped=1.0 2023-06-20 06:25:28,346 INFO [train.py:996] (0/4) Epoch 3, batch 23100, loss[loss=0.217, simple_loss=0.2665, pruned_loss=0.08375, over 19981.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3228, pruned_loss=0.09424, over 4267721.58 frames. ], batch size: 703, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:26:45,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=504714.0, ans=0.0 2023-06-20 06:27:26,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=504774.0, ans=0.0 2023-06-20 06:27:30,100 INFO [train.py:996] (0/4) Epoch 3, batch 23150, loss[loss=0.2653, simple_loss=0.3195, pruned_loss=0.1055, over 21806.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3156, pruned_loss=0.09246, over 4276826.16 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:27:38,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-20 06:27:41,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=504834.0, ans=0.04949747468305833 2023-06-20 06:28:12,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-20 06:28:26,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=504954.0, ans=0.125 2023-06-20 06:29:20,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.598e+02 2.935e+02 3.793e+02 5.216e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-20 06:29:31,070 INFO [train.py:996] (0/4) Epoch 3, batch 23200, loss[loss=0.2841, simple_loss=0.3344, pruned_loss=0.1169, over 21898.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3153, pruned_loss=0.09366, over 4289950.89 frames. ], batch size: 391, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:29:38,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=505134.0, ans=0.125 2023-06-20 06:29:38,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=505134.0, ans=0.05 2023-06-20 06:29:59,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=505194.0, ans=0.125 2023-06-20 06:30:29,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.95 vs. limit=10.0 2023-06-20 06:31:16,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=505374.0, ans=0.125 2023-06-20 06:31:31,756 INFO [train.py:996] (0/4) Epoch 3, batch 23250, loss[loss=0.2827, simple_loss=0.3339, pruned_loss=0.1158, over 21514.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3168, pruned_loss=0.0953, over 4291467.88 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:31:37,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=505434.0, ans=0.125 2023-06-20 06:32:01,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 06:32:26,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=505554.0, ans=0.125 2023-06-20 06:32:42,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-06-20 06:33:25,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.860e+02 3.314e+02 4.094e+02 6.959e+02, threshold=6.628e+02, percent-clipped=4.0 2023-06-20 06:33:30,038 INFO [train.py:996] (0/4) Epoch 3, batch 23300, loss[loss=0.3186, simple_loss=0.3804, pruned_loss=0.1284, over 21711.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3253, pruned_loss=0.09706, over 4290928.70 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:35:29,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=505974.0, ans=0.0 2023-06-20 06:35:36,349 INFO [train.py:996] (0/4) Epoch 3, batch 23350, loss[loss=0.257, simple_loss=0.3158, pruned_loss=0.09908, over 19980.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3309, pruned_loss=0.09718, over 4289888.13 frames. ], batch size: 702, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:35:38,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=506034.0, ans=0.0 2023-06-20 06:36:14,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=506094.0, ans=0.125 2023-06-20 06:36:14,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506094.0, ans=0.125 2023-06-20 06:36:51,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=506154.0, ans=0.125 2023-06-20 06:37:15,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=506214.0, ans=0.125 2023-06-20 06:37:32,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.468e+02 2.769e+02 3.183e+02 5.356e+02, threshold=5.538e+02, percent-clipped=0.0 2023-06-20 06:37:36,525 INFO [train.py:996] (0/4) Epoch 3, batch 23400, loss[loss=0.2411, simple_loss=0.3067, pruned_loss=0.08774, over 21783.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3246, pruned_loss=0.09343, over 4286295.16 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:37:39,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506334.0, ans=0.1 2023-06-20 06:38:05,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-20 06:38:29,815 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:39:07,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506514.0, ans=0.1 2023-06-20 06:39:13,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=506574.0, ans=0.0 2023-06-20 06:39:24,451 INFO [train.py:996] (0/4) Epoch 3, batch 23450, loss[loss=0.272, simple_loss=0.3283, pruned_loss=0.1079, over 21323.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3253, pruned_loss=0.09601, over 4284656.65 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:39:25,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=15.0 2023-06-20 06:39:35,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=506634.0, ans=0.0 2023-06-20 06:39:41,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506634.0, ans=0.1 2023-06-20 06:41:16,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.915e+02 3.484e+02 4.358e+02 6.885e+02, threshold=6.969e+02, percent-clipped=11.0 2023-06-20 06:41:20,560 INFO [train.py:996] (0/4) Epoch 3, batch 23500, loss[loss=0.2813, simple_loss=0.3321, pruned_loss=0.1153, over 21250.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3251, pruned_loss=0.09778, over 4290215.95 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:41:41,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=506934.0, ans=0.0 2023-06-20 06:42:21,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=507054.0, ans=0.125 2023-06-20 06:43:16,474 INFO [train.py:996] (0/4) Epoch 3, batch 23550, loss[loss=0.2728, simple_loss=0.3149, pruned_loss=0.1153, over 21528.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3195, pruned_loss=0.09713, over 4287351.92 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:43:39,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=507234.0, ans=0.125 2023-06-20 06:44:13,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507354.0, ans=0.1 2023-06-20 06:44:32,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=507414.0, ans=0.125 2023-06-20 06:44:52,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.559e+02 3.076e+02 4.076e+02 7.256e+02, threshold=6.152e+02, percent-clipped=1.0 2023-06-20 06:45:02,173 INFO [train.py:996] (0/4) Epoch 3, batch 23600, loss[loss=0.2652, simple_loss=0.3309, pruned_loss=0.09974, over 21327.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3186, pruned_loss=0.09645, over 4281148.41 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:46:05,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-20 06:46:06,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507654.0, ans=0.1 2023-06-20 06:47:22,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=507834.0, ans=0.125 2023-06-20 06:47:23,357 INFO [train.py:996] (0/4) Epoch 3, batch 23650, loss[loss=0.2517, simple_loss=0.3287, pruned_loss=0.08733, over 20734.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3182, pruned_loss=0.09429, over 4281454.02 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:47:32,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=507834.0, ans=0.0 2023-06-20 06:47:38,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-20 06:47:44,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=507834.0, ans=0.125 2023-06-20 06:48:58,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=508074.0, ans=0.125 2023-06-20 06:49:14,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-20 06:49:20,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.816e+02 3.255e+02 3.884e+02 5.582e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-20 06:49:23,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-20 06:49:25,357 INFO [train.py:996] (0/4) Epoch 3, batch 23700, loss[loss=0.298, simple_loss=0.361, pruned_loss=0.1175, over 21736.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3213, pruned_loss=0.09349, over 4285783.83 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-20 06:50:16,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-20 06:50:49,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=508314.0, ans=0.125 2023-06-20 06:50:51,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=508314.0, ans=0.125 2023-06-20 06:50:51,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=508314.0, ans=0.2 2023-06-20 06:51:37,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=508434.0, ans=0.125 2023-06-20 06:51:38,367 INFO [train.py:996] (0/4) Epoch 3, batch 23750, loss[loss=0.2539, simple_loss=0.3275, pruned_loss=0.09018, over 21260.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3239, pruned_loss=0.09413, over 4283613.01 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:52:03,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=508494.0, ans=0.2 2023-06-20 06:53:10,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=508614.0, ans=0.2 2023-06-20 06:53:28,425 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:53:41,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.562e+02 2.929e+02 3.446e+02 6.270e+02, threshold=5.857e+02, percent-clipped=0.0 2023-06-20 06:53:46,621 INFO [train.py:996] (0/4) Epoch 3, batch 23800, loss[loss=0.2797, simple_loss=0.3661, pruned_loss=0.09663, over 21806.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3215, pruned_loss=0.09102, over 4281594.26 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:53:49,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-20 06:55:02,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=508854.0, ans=0.0 2023-06-20 06:55:05,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508854.0, ans=0.1 2023-06-20 06:55:47,565 INFO [train.py:996] (0/4) Epoch 3, batch 23850, loss[loss=0.2759, simple_loss=0.3442, pruned_loss=0.1038, over 21497.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3322, pruned_loss=0.0942, over 4281730.08 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:55:53,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-20 06:56:08,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=509094.0, ans=0.05 2023-06-20 06:56:38,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=509154.0, ans=0.0 2023-06-20 06:57:31,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.767e+02 3.308e+02 4.321e+02 7.651e+02, threshold=6.615e+02, percent-clipped=11.0 2023-06-20 06:57:35,366 INFO [train.py:996] (0/4) Epoch 3, batch 23900, loss[loss=0.246, simple_loss=0.3119, pruned_loss=0.09001, over 21815.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3392, pruned_loss=0.09709, over 4279510.31 frames. ], batch size: 107, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:58:37,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=509454.0, ans=0.0 2023-06-20 06:59:11,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=509574.0, ans=0.0 2023-06-20 06:59:18,311 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:59:18,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-06-20 06:59:19,417 INFO [train.py:996] (0/4) Epoch 3, batch 23950, loss[loss=0.2424, simple_loss=0.3021, pruned_loss=0.09135, over 21817.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3313, pruned_loss=0.09621, over 4265716.38 frames. ], batch size: 98, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 06:59:52,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-20 07:00:00,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.92 vs. limit=12.0 2023-06-20 07:00:15,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=509754.0, ans=0.95 2023-06-20 07:00:20,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=509814.0, ans=0.0 2023-06-20 07:00:37,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=509814.0, ans=0.0 2023-06-20 07:00:46,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2023-06-20 07:00:51,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.718e+02 3.082e+02 3.773e+02 7.054e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-20 07:00:55,598 INFO [train.py:996] (0/4) Epoch 3, batch 24000, loss[loss=0.2904, simple_loss=0.353, pruned_loss=0.1139, over 21392.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3331, pruned_loss=0.09918, over 4263989.00 frames. ], batch size: 549, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:00:55,600 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 07:02:00,872 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2795, simple_loss=0.3782, pruned_loss=0.09043, over 1796401.00 frames. 2023-06-20 07:02:00,873 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 07:02:13,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=509934.0, ans=0.2 2023-06-20 07:02:24,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=509934.0, ans=0.0 2023-06-20 07:02:53,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=510054.0, ans=0.1 2023-06-20 07:04:06,968 INFO [train.py:996] (0/4) Epoch 3, batch 24050, loss[loss=0.228, simple_loss=0.3079, pruned_loss=0.07406, over 21493.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3343, pruned_loss=0.09947, over 4264359.11 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:04:49,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=510354.0, ans=0.0 2023-06-20 07:05:40,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=510414.0, ans=0.125 2023-06-20 07:05:54,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-06-20 07:06:07,957 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.566e+02 3.148e+02 3.843e+02 5.773e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-20 07:06:17,656 INFO [train.py:996] (0/4) Epoch 3, batch 24100, loss[loss=0.2895, simple_loss=0.3564, pruned_loss=0.1113, over 21863.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3355, pruned_loss=0.09814, over 4271227.31 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:06:18,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-20 07:07:22,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510714.0, ans=0.1 2023-06-20 07:07:30,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=510714.0, ans=0.125 2023-06-20 07:07:41,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-20 07:08:00,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=510774.0, ans=0.125 2023-06-20 07:08:01,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=510774.0, ans=0.2 2023-06-20 07:08:05,131 INFO [train.py:996] (0/4) Epoch 3, batch 24150, loss[loss=0.238, simple_loss=0.2885, pruned_loss=0.0937, over 21192.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3353, pruned_loss=0.1006, over 4276975.75 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:08:15,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=510834.0, ans=0.025 2023-06-20 07:09:57,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511074.0, ans=0.1 2023-06-20 07:10:00,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.945e+02 3.371e+02 4.147e+02 7.109e+02, threshold=6.741e+02, percent-clipped=1.0 2023-06-20 07:10:02,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=511074.0, ans=0.125 2023-06-20 07:10:05,091 INFO [train.py:996] (0/4) Epoch 3, batch 24200, loss[loss=0.2635, simple_loss=0.3515, pruned_loss=0.08772, over 21784.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3368, pruned_loss=0.1016, over 4279712.42 frames. ], batch size: 371, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:10:18,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-20 07:11:23,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511314.0, ans=0.1 2023-06-20 07:11:43,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=511314.0, ans=0.04949747468305833 2023-06-20 07:11:49,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=511374.0, ans=0.125 2023-06-20 07:12:08,273 INFO [train.py:996] (0/4) Epoch 3, batch 24250, loss[loss=0.1984, simple_loss=0.2987, pruned_loss=0.04903, over 21646.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.332, pruned_loss=0.09431, over 4274487.63 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:12:08,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=511434.0, ans=0.125 2023-06-20 07:12:33,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=511494.0, ans=0.125 2023-06-20 07:12:34,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-20 07:13:02,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=511554.0, ans=0.125 2023-06-20 07:13:51,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 2.243e+02 2.741e+02 3.193e+02 5.760e+02, threshold=5.481e+02, percent-clipped=0.0 2023-06-20 07:14:00,780 INFO [train.py:996] (0/4) Epoch 3, batch 24300, loss[loss=0.1917, simple_loss=0.2618, pruned_loss=0.06084, over 21454.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3223, pruned_loss=0.08687, over 4269791.37 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:14:59,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.91 vs. limit=22.5 2023-06-20 07:15:26,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=511914.0, ans=0.04949747468305833 2023-06-20 07:16:04,844 INFO [train.py:996] (0/4) Epoch 3, batch 24350, loss[loss=0.2496, simple_loss=0.3103, pruned_loss=0.09442, over 21674.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3199, pruned_loss=0.08764, over 4273081.30 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:16:26,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=512034.0, ans=0.125 2023-06-20 07:17:08,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=512154.0, ans=0.2 2023-06-20 07:17:30,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=512214.0, ans=0.2 2023-06-20 07:18:15,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.742e+02 3.210e+02 4.189e+02 6.509e+02, threshold=6.419e+02, percent-clipped=5.0 2023-06-20 07:18:20,251 INFO [train.py:996] (0/4) Epoch 3, batch 24400, loss[loss=0.2977, simple_loss=0.3498, pruned_loss=0.1228, over 21355.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3253, pruned_loss=0.09176, over 4272602.03 frames. ], batch size: 471, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:18:20,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=512334.0, ans=0.125 2023-06-20 07:18:22,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-06-20 07:18:37,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=512334.0, ans=0.125 2023-06-20 07:20:11,826 INFO [train.py:996] (0/4) Epoch 3, batch 24450, loss[loss=0.2454, simple_loss=0.337, pruned_loss=0.07689, over 21736.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3285, pruned_loss=0.09339, over 4272858.97 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:20:14,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-20 07:21:44,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=512814.0, ans=0.2 2023-06-20 07:22:18,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.557e+02 2.863e+02 3.394e+02 4.374e+02, threshold=5.725e+02, percent-clipped=0.0 2023-06-20 07:22:28,443 INFO [train.py:996] (0/4) Epoch 3, batch 24500, loss[loss=0.3203, simple_loss=0.3588, pruned_loss=0.1409, over 21762.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3269, pruned_loss=0.09274, over 4277894.95 frames. ], batch size: 508, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:22:45,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=512994.0, ans=0.125 2023-06-20 07:23:12,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=513054.0, ans=0.2 2023-06-20 07:23:12,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=513054.0, ans=0.125 2023-06-20 07:23:43,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-20 07:24:09,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-20 07:24:10,063 INFO [train.py:996] (0/4) Epoch 3, batch 24550, loss[loss=0.2892, simple_loss=0.3522, pruned_loss=0.113, over 21228.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3293, pruned_loss=0.09542, over 4281147.16 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:24:15,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=513234.0, ans=0.125 2023-06-20 07:24:43,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=513294.0, ans=0.125 2023-06-20 07:25:14,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=513354.0, ans=0.0 2023-06-20 07:25:14,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=513354.0, ans=0.0 2023-06-20 07:25:17,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513354.0, ans=0.1 2023-06-20 07:25:57,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.731e+02 3.290e+02 3.893e+02 7.449e+02, threshold=6.579e+02, percent-clipped=2.0 2023-06-20 07:26:00,007 INFO [train.py:996] (0/4) Epoch 3, batch 24600, loss[loss=0.234, simple_loss=0.2977, pruned_loss=0.08517, over 21582.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3251, pruned_loss=0.09586, over 4275823.17 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:27:28,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513714.0, ans=0.1 2023-06-20 07:27:33,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=513714.0, ans=0.0 2023-06-20 07:27:59,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=513774.0, ans=0.0 2023-06-20 07:28:06,818 INFO [train.py:996] (0/4) Epoch 3, batch 24650, loss[loss=0.2453, simple_loss=0.3015, pruned_loss=0.0945, over 21763.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3173, pruned_loss=0.09438, over 4275811.16 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:28:13,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513834.0, ans=0.1 2023-06-20 07:29:13,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=513954.0, ans=0.125 2023-06-20 07:29:24,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=514014.0, ans=0.125 2023-06-20 07:29:55,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.706e+02 3.131e+02 3.721e+02 9.290e+02, threshold=6.262e+02, percent-clipped=1.0 2023-06-20 07:29:58,418 INFO [train.py:996] (0/4) Epoch 3, batch 24700, loss[loss=0.224, simple_loss=0.2874, pruned_loss=0.08026, over 21204.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3163, pruned_loss=0.09219, over 4272033.41 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:30:22,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=514134.0, ans=0.035 2023-06-20 07:30:23,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=514134.0, ans=0.2 2023-06-20 07:30:35,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=514194.0, ans=0.0 2023-06-20 07:30:53,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=514254.0, ans=0.125 2023-06-20 07:31:29,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=514314.0, ans=0.125 2023-06-20 07:31:45,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=514374.0, ans=0.125 2023-06-20 07:31:46,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=514374.0, ans=0.2 2023-06-20 07:31:53,323 INFO [train.py:996] (0/4) Epoch 3, batch 24750, loss[loss=0.225, simple_loss=0.2833, pruned_loss=0.08337, over 21618.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3102, pruned_loss=0.08911, over 4267922.59 frames. ], batch size: 415, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:32:11,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514434.0, ans=0.0 2023-06-20 07:32:35,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=514494.0, ans=0.2 2023-06-20 07:33:08,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=514554.0, ans=0.0 2023-06-20 07:33:13,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=514614.0, ans=0.2 2023-06-20 07:33:19,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-20 07:33:26,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514674.0, ans=0.0 2023-06-20 07:33:46,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=514674.0, ans=0.125 2023-06-20 07:33:52,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.345e+02 2.611e+02 2.913e+02 4.887e+02, threshold=5.223e+02, percent-clipped=0.0 2023-06-20 07:34:06,597 INFO [train.py:996] (0/4) Epoch 3, batch 24800, loss[loss=0.2123, simple_loss=0.2552, pruned_loss=0.08473, over 21025.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3038, pruned_loss=0.08861, over 4273068.22 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:35:05,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=514854.0, ans=0.2 2023-06-20 07:35:05,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=514854.0, ans=0.125 2023-06-20 07:35:20,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-20 07:35:52,707 INFO [train.py:996] (0/4) Epoch 3, batch 24850, loss[loss=0.2153, simple_loss=0.269, pruned_loss=0.08081, over 21314.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3064, pruned_loss=0.09087, over 4278375.07 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:35:54,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=515034.0, ans=0.05 2023-06-20 07:37:15,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=515214.0, ans=0.125 2023-06-20 07:37:28,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=515274.0, ans=0.125 2023-06-20 07:37:40,962 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:37:46,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.912e+02 3.452e+02 3.888e+02 6.528e+02, threshold=6.903e+02, percent-clipped=3.0 2023-06-20 07:37:46,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=515274.0, ans=0.125 2023-06-20 07:37:49,496 INFO [train.py:996] (0/4) Epoch 3, batch 24900, loss[loss=0.2963, simple_loss=0.3582, pruned_loss=0.1172, over 21483.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3102, pruned_loss=0.09232, over 4279404.82 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:38:15,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=515394.0, ans=0.0 2023-06-20 07:39:16,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=515574.0, ans=0.125 2023-06-20 07:39:39,335 INFO [train.py:996] (0/4) Epoch 3, batch 24950, loss[loss=0.2916, simple_loss=0.3542, pruned_loss=0.1146, over 20643.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3196, pruned_loss=0.0976, over 4281021.41 frames. ], batch size: 607, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:39:47,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=515634.0, ans=0.2 2023-06-20 07:39:57,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515634.0, ans=0.1 2023-06-20 07:40:19,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=515754.0, ans=0.0 2023-06-20 07:40:45,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=515814.0, ans=0.1 2023-06-20 07:41:33,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.916e+02 3.619e+02 4.634e+02 7.027e+02, threshold=7.237e+02, percent-clipped=1.0 2023-06-20 07:41:36,089 INFO [train.py:996] (0/4) Epoch 3, batch 25000, loss[loss=0.2329, simple_loss=0.3019, pruned_loss=0.08193, over 21279.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3269, pruned_loss=0.09978, over 4278487.41 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:41:39,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=515934.0, ans=0.125 2023-06-20 07:41:46,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=515934.0, ans=0.125 2023-06-20 07:41:52,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-20 07:42:01,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=515994.0, ans=0.0 2023-06-20 07:42:49,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516114.0, ans=0.1 2023-06-20 07:42:53,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=12.0 2023-06-20 07:43:05,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=516114.0, ans=0.0 2023-06-20 07:43:06,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-20 07:43:15,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=516174.0, ans=0.05 2023-06-20 07:43:28,176 INFO [train.py:996] (0/4) Epoch 3, batch 25050, loss[loss=0.2134, simple_loss=0.2695, pruned_loss=0.07864, over 21665.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3187, pruned_loss=0.09772, over 4267034.00 frames. ], batch size: 248, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:44:32,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=516354.0, ans=0.125 2023-06-20 07:44:52,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-20 07:45:30,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.526e+02 2.799e+02 3.395e+02 4.701e+02, threshold=5.598e+02, percent-clipped=0.0 2023-06-20 07:45:33,105 INFO [train.py:996] (0/4) Epoch 3, batch 25100, loss[loss=0.2472, simple_loss=0.2884, pruned_loss=0.103, over 21331.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3127, pruned_loss=0.09617, over 4263875.52 frames. ], batch size: 473, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:45:36,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=516534.0, ans=0.5 2023-06-20 07:45:52,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=516594.0, ans=0.125 2023-06-20 07:46:04,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=516594.0, ans=0.0 2023-06-20 07:47:26,339 INFO [train.py:996] (0/4) Epoch 3, batch 25150, loss[loss=0.3003, simple_loss=0.3775, pruned_loss=0.1116, over 21437.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3175, pruned_loss=0.09386, over 4271161.36 frames. ], batch size: 471, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:47:27,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-20 07:47:53,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=516894.0, ans=0.125 2023-06-20 07:48:09,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-20 07:48:14,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=516954.0, ans=0.125 2023-06-20 07:48:57,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517074.0, ans=0.1 2023-06-20 07:49:12,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.389e+02 2.624e+02 3.346e+02 4.774e+02, threshold=5.249e+02, percent-clipped=0.0 2023-06-20 07:49:15,105 INFO [train.py:996] (0/4) Epoch 3, batch 25200, loss[loss=0.2192, simple_loss=0.305, pruned_loss=0.06672, over 21697.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3167, pruned_loss=0.09067, over 4272553.56 frames. ], batch size: 298, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:49:15,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=517134.0, ans=0.125 2023-06-20 07:49:34,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=517134.0, ans=0.125 2023-06-20 07:49:54,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=517254.0, ans=0.02 2023-06-20 07:49:56,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-20 07:50:50,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=517374.0, ans=12.0 2023-06-20 07:51:12,212 INFO [train.py:996] (0/4) Epoch 3, batch 25250, loss[loss=0.2186, simple_loss=0.2971, pruned_loss=0.07007, over 16713.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3131, pruned_loss=0.08834, over 4260376.00 frames. ], batch size: 62, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:51:25,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-20 07:51:42,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=517494.0, ans=0.125 2023-06-20 07:53:09,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.525e+02 2.860e+02 3.493e+02 8.717e+02, threshold=5.720e+02, percent-clipped=4.0 2023-06-20 07:53:12,415 INFO [train.py:996] (0/4) Epoch 3, batch 25300, loss[loss=0.2868, simple_loss=0.3613, pruned_loss=0.1061, over 21445.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3108, pruned_loss=0.0881, over 4253034.59 frames. ], batch size: 131, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:53:30,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517734.0, ans=0.125 2023-06-20 07:53:36,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=517794.0, ans=0.2 2023-06-20 07:54:59,681 INFO [train.py:996] (0/4) Epoch 3, batch 25350, loss[loss=0.1895, simple_loss=0.2664, pruned_loss=0.05634, over 21381.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3121, pruned_loss=0.08699, over 4243257.04 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:55:53,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-20 07:56:20,384 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:56:49,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-20 07:56:51,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.620e+02 3.050e+02 3.855e+02 6.289e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-20 07:56:53,737 INFO [train.py:996] (0/4) Epoch 3, batch 25400, loss[loss=0.2195, simple_loss=0.2736, pruned_loss=0.08274, over 21473.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3074, pruned_loss=0.08613, over 4246173.72 frames. ], batch size: 212, lr: 1.03e-02, grad_scale: 32.0 2023-06-20 07:57:00,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=518334.0, ans=0.0 2023-06-20 07:57:45,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=518454.0, ans=15.0 2023-06-20 07:57:56,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=518514.0, ans=0.125 2023-06-20 07:58:01,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=518514.0, ans=10.0 2023-06-20 07:58:24,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=12.0 2023-06-20 07:58:31,275 INFO [train.py:996] (0/4) Epoch 3, batch 25450, loss[loss=0.2407, simple_loss=0.3117, pruned_loss=0.08486, over 21810.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3085, pruned_loss=0.08801, over 4255854.64 frames. ], batch size: 118, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 07:59:07,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=518754.0, ans=0.09899494936611666 2023-06-20 08:00:13,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.261e+02 2.543e+02 3.250e+02 4.751e+02, threshold=5.087e+02, percent-clipped=0.0 2023-06-20 08:00:16,652 INFO [train.py:996] (0/4) Epoch 3, batch 25500, loss[loss=0.2407, simple_loss=0.3182, pruned_loss=0.08155, over 21260.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3091, pruned_loss=0.08354, over 4258164.68 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:00:43,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=518934.0, ans=0.125 2023-06-20 08:01:11,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=519054.0, ans=0.125 2023-06-20 08:01:35,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=519114.0, ans=0.0 2023-06-20 08:01:37,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519114.0, ans=0.1 2023-06-20 08:01:43,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=519114.0, ans=0.0 2023-06-20 08:01:53,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=519114.0, ans=0.2 2023-06-20 08:02:39,009 INFO [train.py:996] (0/4) Epoch 3, batch 25550, loss[loss=0.2571, simple_loss=0.321, pruned_loss=0.09659, over 19997.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3155, pruned_loss=0.08417, over 4248987.04 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:02:48,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=519234.0, ans=6.0 2023-06-20 08:04:23,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=519474.0, ans=0.125 2023-06-20 08:04:40,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=519474.0, ans=10.0 2023-06-20 08:04:40,731 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.576e+02 2.895e+02 3.439e+02 5.948e+02, threshold=5.790e+02, percent-clipped=2.0 2023-06-20 08:04:43,682 INFO [train.py:996] (0/4) Epoch 3, batch 25600, loss[loss=0.2703, simple_loss=0.3334, pruned_loss=0.1036, over 20777.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3191, pruned_loss=0.08525, over 4255332.28 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:05:06,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519534.0, ans=0.1 2023-06-20 08:05:20,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=519594.0, ans=0.2 2023-06-20 08:05:23,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-20 08:05:47,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=519654.0, ans=0.0 2023-06-20 08:05:56,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=519654.0, ans=0.0 2023-06-20 08:06:33,953 INFO [train.py:996] (0/4) Epoch 3, batch 25650, loss[loss=0.2695, simple_loss=0.3116, pruned_loss=0.1137, over 21238.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3207, pruned_loss=0.08865, over 4243417.21 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:07:06,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=519894.0, ans=0.125 2023-06-20 08:07:41,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=15.0 2023-06-20 08:07:41,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=520014.0, ans=0.0 2023-06-20 08:07:58,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-06-20 08:08:07,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.861e+02 3.374e+02 3.830e+02 5.312e+02, threshold=6.747e+02, percent-clipped=0.0 2023-06-20 08:08:10,529 INFO [train.py:996] (0/4) Epoch 3, batch 25700, loss[loss=0.2537, simple_loss=0.3215, pruned_loss=0.09296, over 21371.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3175, pruned_loss=0.08972, over 4257776.25 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:08:29,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=520194.0, ans=0.0 2023-06-20 08:08:43,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=520194.0, ans=0.125 2023-06-20 08:08:45,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=520254.0, ans=0.1 2023-06-20 08:10:17,144 INFO [train.py:996] (0/4) Epoch 3, batch 25750, loss[loss=0.3786, simple_loss=0.4459, pruned_loss=0.1556, over 21739.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3249, pruned_loss=0.0934, over 4255839.82 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:10:39,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=520434.0, ans=0.2 2023-06-20 08:11:15,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520494.0, ans=0.1 2023-06-20 08:11:24,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=520554.0, ans=0.0 2023-06-20 08:11:45,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=520614.0, ans=0.09899494936611666 2023-06-20 08:11:46,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=520614.0, ans=0.0 2023-06-20 08:12:03,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=520674.0, ans=0.07 2023-06-20 08:12:19,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.970e+02 3.422e+02 4.154e+02 6.514e+02, threshold=6.844e+02, percent-clipped=0.0 2023-06-20 08:12:22,612 INFO [train.py:996] (0/4) Epoch 3, batch 25800, loss[loss=0.3068, simple_loss=0.3759, pruned_loss=0.1189, over 21445.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3387, pruned_loss=0.09823, over 4258644.15 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:12:34,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=520734.0, ans=0.025 2023-06-20 08:12:46,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-20 08:13:02,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-20 08:13:29,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=520854.0, ans=0.0 2023-06-20 08:14:00,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=520914.0, ans=0.04949747468305833 2023-06-20 08:14:16,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=520974.0, ans=0.0 2023-06-20 08:14:42,708 INFO [train.py:996] (0/4) Epoch 3, batch 25850, loss[loss=0.2754, simple_loss=0.3422, pruned_loss=0.1043, over 21746.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3396, pruned_loss=0.098, over 4262227.52 frames. ], batch size: 389, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:15:55,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=521214.0, ans=0.015 2023-06-20 08:16:01,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-20 08:16:12,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=521274.0, ans=0.025 2023-06-20 08:16:38,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=521274.0, ans=0.2 2023-06-20 08:16:42,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.635e+02 3.168e+02 4.552e+02 6.616e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-20 08:16:45,274 INFO [train.py:996] (0/4) Epoch 3, batch 25900, loss[loss=0.3342, simple_loss=0.4114, pruned_loss=0.1284, over 21700.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3419, pruned_loss=0.09935, over 4262532.56 frames. ], batch size: 414, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:16:46,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 08:17:16,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=521394.0, ans=0.125 2023-06-20 08:17:59,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=521454.0, ans=0.0 2023-06-20 08:18:53,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=521634.0, ans=0.2 2023-06-20 08:18:54,003 INFO [train.py:996] (0/4) Epoch 3, batch 25950, loss[loss=0.234, simple_loss=0.3001, pruned_loss=0.08394, over 21196.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3457, pruned_loss=0.1015, over 4267032.02 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:20:02,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521814.0, ans=0.125 2023-06-20 08:20:07,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.84 vs. limit=22.5 2023-06-20 08:20:46,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.642e+02 3.151e+02 3.673e+02 6.319e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-20 08:20:49,288 INFO [train.py:996] (0/4) Epoch 3, batch 26000, loss[loss=0.289, simple_loss=0.349, pruned_loss=0.1145, over 20695.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3465, pruned_loss=0.1001, over 4266629.06 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:20:55,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=521934.0, ans=0.125 2023-06-20 08:21:03,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-20 08:21:25,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=521994.0, ans=0.0 2023-06-20 08:21:31,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=521994.0, ans=0.125 2023-06-20 08:22:01,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.02 vs. limit=12.0 2023-06-20 08:22:10,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.71 vs. limit=10.0 2023-06-20 08:22:35,218 INFO [train.py:996] (0/4) Epoch 3, batch 26050, loss[loss=0.2354, simple_loss=0.2993, pruned_loss=0.08578, over 21949.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3459, pruned_loss=0.09968, over 4265671.33 frames. ], batch size: 283, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:23:11,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=522294.0, ans=0.125 2023-06-20 08:23:50,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522354.0, ans=0.1 2023-06-20 08:24:11,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-20 08:24:20,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=12.0 2023-06-20 08:24:35,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522474.0, ans=0.1 2023-06-20 08:24:39,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-20 08:24:39,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.804e+02 3.203e+02 3.918e+02 6.790e+02, threshold=6.407e+02, percent-clipped=4.0 2023-06-20 08:24:42,771 INFO [train.py:996] (0/4) Epoch 3, batch 26100, loss[loss=0.2473, simple_loss=0.308, pruned_loss=0.09331, over 21883.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3403, pruned_loss=0.09997, over 4266674.67 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:24:54,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522534.0, ans=0.1 2023-06-20 08:25:31,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.68 vs. limit=10.0 2023-06-20 08:25:35,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=522654.0, ans=0.125 2023-06-20 08:25:48,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=522654.0, ans=0.125 2023-06-20 08:26:02,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-20 08:26:18,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522714.0, ans=0.1 2023-06-20 08:26:38,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=522774.0, ans=0.125 2023-06-20 08:26:46,812 INFO [train.py:996] (0/4) Epoch 3, batch 26150, loss[loss=0.2709, simple_loss=0.3252, pruned_loss=0.1083, over 19996.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3355, pruned_loss=0.1001, over 4269848.04 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:27:01,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=522834.0, ans=0.0 2023-06-20 08:27:27,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-20 08:27:42,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=522954.0, ans=0.125 2023-06-20 08:28:03,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523014.0, ans=0.1 2023-06-20 08:28:12,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=523014.0, ans=0.0 2023-06-20 08:28:14,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=523014.0, ans=0.125 2023-06-20 08:28:25,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=523014.0, ans=0.0 2023-06-20 08:28:38,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=523074.0, ans=0.125 2023-06-20 08:28:39,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.726e+02 3.009e+02 3.723e+02 5.538e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-20 08:28:49,608 INFO [train.py:996] (0/4) Epoch 3, batch 26200, loss[loss=0.2547, simple_loss=0.3033, pruned_loss=0.1031, over 20032.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.336, pruned_loss=0.09811, over 4274132.01 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:29:04,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-06-20 08:29:06,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=523134.0, ans=0.0 2023-06-20 08:29:21,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=523194.0, ans=0.2 2023-06-20 08:30:15,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=523254.0, ans=0.0 2023-06-20 08:30:18,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-20 08:30:31,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-20 08:30:59,893 INFO [train.py:996] (0/4) Epoch 3, batch 26250, loss[loss=0.2743, simple_loss=0.3378, pruned_loss=0.1054, over 21553.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3402, pruned_loss=0.09739, over 4275488.98 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:31:32,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-20 08:31:48,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=523494.0, ans=0.125 2023-06-20 08:31:49,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=523494.0, ans=0.125 2023-06-20 08:32:28,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=523614.0, ans=0.0 2023-06-20 08:33:04,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.693e+02 3.337e+02 4.034e+02 6.745e+02, threshold=6.673e+02, percent-clipped=1.0 2023-06-20 08:33:07,770 INFO [train.py:996] (0/4) Epoch 3, batch 26300, loss[loss=0.2951, simple_loss=0.3485, pruned_loss=0.1209, over 21883.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3364, pruned_loss=0.09799, over 4282342.92 frames. ], batch size: 124, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:33:08,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=523734.0, ans=0.125 2023-06-20 08:33:54,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-20 08:34:04,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=523854.0, ans=0.125 2023-06-20 08:34:24,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=523914.0, ans=0.0 2023-06-20 08:35:08,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=22.5 2023-06-20 08:35:13,488 INFO [train.py:996] (0/4) Epoch 3, batch 26350, loss[loss=0.2804, simple_loss=0.3424, pruned_loss=0.1092, over 21875.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3353, pruned_loss=0.09897, over 4284502.91 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:35:39,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=524034.0, ans=0.0 2023-06-20 08:37:02,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.700e+02 3.038e+02 3.604e+02 6.055e+02, threshold=6.077e+02, percent-clipped=0.0 2023-06-20 08:37:04,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=524334.0, ans=0.125 2023-06-20 08:37:05,307 INFO [train.py:996] (0/4) Epoch 3, batch 26400, loss[loss=0.2324, simple_loss=0.2843, pruned_loss=0.0903, over 21263.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.33, pruned_loss=0.09968, over 4276926.04 frames. ], batch size: 160, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:38:11,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-20 08:39:02,891 INFO [train.py:996] (0/4) Epoch 3, batch 26450, loss[loss=0.3042, simple_loss=0.3968, pruned_loss=0.1058, over 21727.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3291, pruned_loss=0.09864, over 4271791.96 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:39:09,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=524634.0, ans=0.0 2023-06-20 08:39:47,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=524754.0, ans=0.125 2023-06-20 08:40:25,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524814.0, ans=0.1 2023-06-20 08:41:03,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=22.5 2023-06-20 08:41:08,731 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.874e+02 3.456e+02 4.321e+02 8.810e+02, threshold=6.911e+02, percent-clipped=7.0 2023-06-20 08:41:11,682 INFO [train.py:996] (0/4) Epoch 3, batch 26500, loss[loss=0.2363, simple_loss=0.3024, pruned_loss=0.08512, over 21631.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3316, pruned_loss=0.09758, over 4275710.22 frames. ], batch size: 230, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:41:26,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=524934.0, ans=0.2 2023-06-20 08:42:30,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=525114.0, ans=0.0 2023-06-20 08:43:21,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=525174.0, ans=0.07 2023-06-20 08:43:27,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=525234.0, ans=0.125 2023-06-20 08:43:28,417 INFO [train.py:996] (0/4) Epoch 3, batch 26550, loss[loss=0.2744, simple_loss=0.3712, pruned_loss=0.08877, over 19796.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3288, pruned_loss=0.09456, over 4263909.19 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 64.0 2023-06-20 08:44:18,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=525294.0, ans=0.1 2023-06-20 08:44:45,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-20 08:44:54,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=525414.0, ans=0.125 2023-06-20 08:45:29,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=525474.0, ans=0.0 2023-06-20 08:45:39,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.519e+02 3.120e+02 3.991e+02 8.354e+02, threshold=6.239e+02, percent-clipped=2.0 2023-06-20 08:45:39,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=525534.0, ans=0.125 2023-06-20 08:45:40,624 INFO [train.py:996] (0/4) Epoch 3, batch 26600, loss[loss=0.2704, simple_loss=0.3183, pruned_loss=0.1113, over 19983.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.327, pruned_loss=0.09137, over 4264587.92 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:46:30,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=525654.0, ans=0.125 2023-06-20 08:47:41,372 INFO [train.py:996] (0/4) Epoch 3, batch 26650, loss[loss=0.2428, simple_loss=0.3045, pruned_loss=0.09053, over 21889.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3198, pruned_loss=0.08997, over 4264818.27 frames. ], batch size: 107, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:48:26,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=525954.0, ans=0.125 2023-06-20 08:48:36,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=526014.0, ans=0.0 2023-06-20 08:48:45,810 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:49:16,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 2.277e+02 2.544e+02 2.972e+02 4.413e+02, threshold=5.088e+02, percent-clipped=0.0 2023-06-20 08:49:23,175 INFO [train.py:996] (0/4) Epoch 3, batch 26700, loss[loss=0.3053, simple_loss=0.3436, pruned_loss=0.1335, over 21784.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3131, pruned_loss=0.08698, over 4272266.45 frames. ], batch size: 508, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:50:06,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=526194.0, ans=0.125 2023-06-20 08:51:36,259 INFO [train.py:996] (0/4) Epoch 3, batch 26750, loss[loss=0.2519, simple_loss=0.3305, pruned_loss=0.08662, over 20718.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3121, pruned_loss=0.08571, over 4274887.37 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:52:09,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=526494.0, ans=0.125 2023-06-20 08:52:29,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=526554.0, ans=0.125 2023-06-20 08:53:01,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=526614.0, ans=0.125 2023-06-20 08:53:09,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=526614.0, ans=0.0 2023-06-20 08:53:09,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=526614.0, ans=0.0 2023-06-20 08:53:15,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=526674.0, ans=0.125 2023-06-20 08:53:18,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=526674.0, ans=0.125 2023-06-20 08:53:55,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.827e+02 3.440e+02 3.863e+02 5.872e+02, threshold=6.879e+02, percent-clipped=7.0 2023-06-20 08:54:02,659 INFO [train.py:996] (0/4) Epoch 3, batch 26800, loss[loss=0.2969, simple_loss=0.3648, pruned_loss=0.1145, over 21226.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3213, pruned_loss=0.0908, over 4276543.34 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:54:13,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526734.0, ans=0.1 2023-06-20 08:54:20,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=526794.0, ans=0.2 2023-06-20 08:54:54,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=526854.0, ans=0.125 2023-06-20 08:55:17,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=526914.0, ans=0.07 2023-06-20 08:55:17,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=526914.0, ans=0.125 2023-06-20 08:55:28,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=526974.0, ans=0.2 2023-06-20 08:55:44,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-20 08:55:44,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=526974.0, ans=0.125 2023-06-20 08:55:54,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=526974.0, ans=0.0 2023-06-20 08:56:05,474 INFO [train.py:996] (0/4) Epoch 3, batch 26850, loss[loss=0.2348, simple_loss=0.3014, pruned_loss=0.08405, over 21792.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3236, pruned_loss=0.09401, over 4263115.69 frames. ], batch size: 118, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:56:06,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-20 08:56:30,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=527094.0, ans=10.0 2023-06-20 08:56:33,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=527094.0, ans=0.2 2023-06-20 08:56:39,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=527154.0, ans=0.125 2023-06-20 08:57:39,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.442e+02 3.000e+02 3.641e+02 8.761e+02, threshold=6.000e+02, percent-clipped=1.0 2023-06-20 08:57:40,680 INFO [train.py:996] (0/4) Epoch 3, batch 26900, loss[loss=0.2171, simple_loss=0.2725, pruned_loss=0.08085, over 21353.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3141, pruned_loss=0.09201, over 4256090.82 frames. ], batch size: 160, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 08:58:31,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527394.0, ans=0.1 2023-06-20 08:58:45,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=527454.0, ans=0.0 2023-06-20 08:59:47,452 INFO [train.py:996] (0/4) Epoch 3, batch 26950, loss[loss=0.2673, simple_loss=0.3537, pruned_loss=0.09051, over 21712.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3128, pruned_loss=0.09079, over 4238290.90 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 09:00:13,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=527694.0, ans=0.125 2023-06-20 09:01:39,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.583e+02 3.256e+02 3.814e+02 7.772e+02, threshold=6.512e+02, percent-clipped=3.0 2023-06-20 09:01:52,868 INFO [train.py:996] (0/4) Epoch 3, batch 27000, loss[loss=0.2933, simple_loss=0.3594, pruned_loss=0.1136, over 21458.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3133, pruned_loss=0.08862, over 4250378.46 frames. ], batch size: 508, lr: 1.02e-02, grad_scale: 32.0 2023-06-20 09:01:52,870 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 09:02:49,143 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2585, simple_loss=0.355, pruned_loss=0.081, over 1796401.00 frames. 2023-06-20 09:02:49,144 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 09:03:04,347 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-88000.pt 2023-06-20 09:03:15,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=527994.0, ans=0.0 2023-06-20 09:04:29,068 INFO [train.py:996] (0/4) Epoch 3, batch 27050, loss[loss=0.2456, simple_loss=0.3451, pruned_loss=0.07306, over 20763.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3151, pruned_loss=0.08482, over 4255954.51 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 16.0 2023-06-20 09:06:30,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528534.0, ans=0.1 2023-06-20 09:06:31,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.379e+02 2.798e+02 3.211e+02 4.498e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 09:06:31,062 INFO [train.py:996] (0/4) Epoch 3, batch 27100, loss[loss=0.2287, simple_loss=0.3165, pruned_loss=0.0705, over 20987.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3164, pruned_loss=0.08573, over 4267997.95 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 16.0 2023-06-20 09:07:26,888 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:08:32,561 INFO [train.py:996] (0/4) Epoch 3, batch 27150, loss[loss=0.2727, simple_loss=0.3597, pruned_loss=0.09286, over 21744.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3274, pruned_loss=0.08874, over 4268264.54 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 09:09:14,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=528894.0, ans=0.2 2023-06-20 09:09:20,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528894.0, ans=0.1 2023-06-20 09:09:56,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=529014.0, ans=0.125 2023-06-20 09:10:00,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=529014.0, ans=0.125 2023-06-20 09:10:33,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=529074.0, ans=0.125 2023-06-20 09:10:50,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.751e+02 3.147e+02 3.770e+02 6.870e+02, threshold=6.294e+02, percent-clipped=5.0 2023-06-20 09:10:50,543 INFO [train.py:996] (0/4) Epoch 3, batch 27200, loss[loss=0.3086, simple_loss=0.3902, pruned_loss=0.1135, over 21664.00 frames. ], tot_loss[loss=0.261, simple_loss=0.337, pruned_loss=0.09248, over 4277690.09 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:12:51,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=529374.0, ans=0.125 2023-06-20 09:12:57,433 INFO [train.py:996] (0/4) Epoch 3, batch 27250, loss[loss=0.3053, simple_loss=0.374, pruned_loss=0.1183, over 21856.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3397, pruned_loss=0.09697, over 4272612.54 frames. ], batch size: 118, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:12:57,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=529434.0, ans=0.0 2023-06-20 09:13:39,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=529554.0, ans=0.0 2023-06-20 09:13:54,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-20 09:13:58,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=529614.0, ans=0.1 2023-06-20 09:14:59,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=529674.0, ans=0.95 2023-06-20 09:15:03,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.889e+02 3.267e+02 4.171e+02 5.993e+02, threshold=6.535e+02, percent-clipped=0.0 2023-06-20 09:15:03,723 INFO [train.py:996] (0/4) Epoch 3, batch 27300, loss[loss=0.2643, simple_loss=0.3472, pruned_loss=0.09071, over 21710.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3428, pruned_loss=0.09901, over 4272877.08 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:15:04,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=529734.0, ans=0.125 2023-06-20 09:16:55,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=529974.0, ans=10.0 2023-06-20 09:17:26,159 INFO [train.py:996] (0/4) Epoch 3, batch 27350, loss[loss=0.2397, simple_loss=0.3311, pruned_loss=0.07419, over 21599.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3446, pruned_loss=0.1002, over 4275915.12 frames. ], batch size: 230, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:17:48,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=530094.0, ans=0.125 2023-06-20 09:17:49,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=530094.0, ans=0.125 2023-06-20 09:19:07,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=530214.0, ans=0.125 2023-06-20 09:19:27,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=530274.0, ans=0.0 2023-06-20 09:19:27,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.80 vs. limit=22.5 2023-06-20 09:19:31,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.578e+02 2.892e+02 3.255e+02 4.289e+02, threshold=5.785e+02, percent-clipped=0.0 2023-06-20 09:19:31,155 INFO [train.py:996] (0/4) Epoch 3, batch 27400, loss[loss=0.2269, simple_loss=0.2862, pruned_loss=0.08384, over 21619.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3401, pruned_loss=0.09954, over 4281080.39 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:19:33,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-20 09:19:35,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530334.0, ans=0.0 2023-06-20 09:20:18,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=530454.0, ans=0.0 2023-06-20 09:20:24,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-20 09:21:07,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=530514.0, ans=0.125 2023-06-20 09:21:19,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530574.0, ans=0.1 2023-06-20 09:21:21,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-20 09:21:32,888 INFO [train.py:996] (0/4) Epoch 3, batch 27450, loss[loss=0.2521, simple_loss=0.3385, pruned_loss=0.0828, over 21712.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3326, pruned_loss=0.09711, over 4280168.19 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:21:33,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=530634.0, ans=0.2 2023-06-20 09:21:41,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=530634.0, ans=0.125 2023-06-20 09:22:25,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=530754.0, ans=0.125 2023-06-20 09:22:43,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-20 09:22:45,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2023-06-20 09:23:33,231 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.511e+02 2.901e+02 3.384e+02 5.453e+02, threshold=5.802e+02, percent-clipped=0.0 2023-06-20 09:23:33,267 INFO [train.py:996] (0/4) Epoch 3, batch 27500, loss[loss=0.2944, simple_loss=0.3466, pruned_loss=0.1211, over 21768.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3312, pruned_loss=0.09718, over 4284890.14 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:23:56,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=530934.0, ans=0.125 2023-06-20 09:24:05,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530994.0, ans=0.125 2023-06-20 09:24:06,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530994.0, ans=0.125 2023-06-20 09:24:19,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=531054.0, ans=0.125 2023-06-20 09:24:32,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-20 09:25:08,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 09:25:12,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-20 09:25:19,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=531174.0, ans=0.125 2023-06-20 09:25:24,436 INFO [train.py:996] (0/4) Epoch 3, batch 27550, loss[loss=0.2006, simple_loss=0.279, pruned_loss=0.06112, over 21371.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3243, pruned_loss=0.09341, over 4280845.31 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:25:59,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=531294.0, ans=0.125 2023-06-20 09:26:17,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=531354.0, ans=0.125 2023-06-20 09:26:27,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=531354.0, ans=0.125 2023-06-20 09:27:16,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.453e+02 3.210e+02 4.120e+02 6.200e+02, threshold=6.421e+02, percent-clipped=4.0 2023-06-20 09:27:16,472 INFO [train.py:996] (0/4) Epoch 3, batch 27600, loss[loss=0.2486, simple_loss=0.3066, pruned_loss=0.09528, over 21573.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3177, pruned_loss=0.09233, over 4284102.62 frames. ], batch size: 391, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:27:18,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=531534.0, ans=0.125 2023-06-20 09:27:43,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-20 09:29:04,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-20 09:29:08,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=531774.0, ans=0.125 2023-06-20 09:29:12,395 INFO [train.py:996] (0/4) Epoch 3, batch 27650, loss[loss=0.2384, simple_loss=0.312, pruned_loss=0.08235, over 21339.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3112, pruned_loss=0.09113, over 4276871.03 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:29:13,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-20 09:29:14,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-20 09:29:20,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-20 09:30:12,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-20 09:31:10,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.456e+02 3.066e+02 3.957e+02 5.583e+02, threshold=6.132e+02, percent-clipped=0.0 2023-06-20 09:31:10,956 INFO [train.py:996] (0/4) Epoch 3, batch 27700, loss[loss=0.2826, simple_loss=0.369, pruned_loss=0.09809, over 20883.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3118, pruned_loss=0.0889, over 4279424.67 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:31:37,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=532134.0, ans=0.0 2023-06-20 09:31:41,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532194.0, ans=0.1 2023-06-20 09:31:42,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532194.0, ans=0.1 2023-06-20 09:31:59,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=532194.0, ans=0.125 2023-06-20 09:32:42,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=532314.0, ans=0.0 2023-06-20 09:33:01,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-20 09:33:02,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=532374.0, ans=0.125 2023-06-20 09:33:20,142 INFO [train.py:996] (0/4) Epoch 3, batch 27750, loss[loss=0.2276, simple_loss=0.3205, pruned_loss=0.06732, over 21281.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.316, pruned_loss=0.08859, over 4279238.46 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:33:47,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:33:52,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:33:56,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=532494.0, ans=0.125 2023-06-20 09:33:59,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532494.0, ans=0.0 2023-06-20 09:34:00,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-20 09:34:16,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=532554.0, ans=0.125 2023-06-20 09:35:07,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532674.0, ans=0.0 2023-06-20 09:35:16,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.578e+02 3.101e+02 3.636e+02 6.452e+02, threshold=6.201e+02, percent-clipped=3.0 2023-06-20 09:35:17,007 INFO [train.py:996] (0/4) Epoch 3, batch 27800, loss[loss=0.2454, simple_loss=0.3089, pruned_loss=0.09097, over 21877.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3149, pruned_loss=0.08908, over 4284241.12 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:36:48,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=532914.0, ans=0.125 2023-06-20 09:37:23,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=532974.0, ans=0.07 2023-06-20 09:37:26,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=533034.0, ans=22.5 2023-06-20 09:37:27,378 INFO [train.py:996] (0/4) Epoch 3, batch 27850, loss[loss=0.3267, simple_loss=0.3946, pruned_loss=0.1294, over 21573.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3149, pruned_loss=0.0908, over 4292939.80 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:37:47,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-20 09:37:57,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-20 09:39:02,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=533214.0, ans=0.2 2023-06-20 09:39:31,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=533274.0, ans=0.2 2023-06-20 09:39:41,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.579e+02 2.899e+02 3.562e+02 7.537e+02, threshold=5.798e+02, percent-clipped=1.0 2023-06-20 09:39:41,770 INFO [train.py:996] (0/4) Epoch 3, batch 27900, loss[loss=0.2245, simple_loss=0.3086, pruned_loss=0.07022, over 21416.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3231, pruned_loss=0.09212, over 4288238.76 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:39:45,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-20 09:40:19,401 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:40:23,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=533394.0, ans=0.125 2023-06-20 09:41:17,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-20 09:41:35,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533574.0, ans=0.1 2023-06-20 09:41:47,950 INFO [train.py:996] (0/4) Epoch 3, batch 27950, loss[loss=0.2239, simple_loss=0.3172, pruned_loss=0.06528, over 21730.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3227, pruned_loss=0.08808, over 4279488.02 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:41:49,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533634.0, ans=0.125 2023-06-20 09:43:29,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=533874.0, ans=0.0 2023-06-20 09:43:29,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=533874.0, ans=0.125 2023-06-20 09:43:48,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=533874.0, ans=0.125 2023-06-20 09:43:54,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.301e+02 2.628e+02 3.233e+02 4.917e+02, threshold=5.255e+02, percent-clipped=0.0 2023-06-20 09:43:54,359 INFO [train.py:996] (0/4) Epoch 3, batch 28000, loss[loss=0.2572, simple_loss=0.3243, pruned_loss=0.09506, over 21863.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3204, pruned_loss=0.08613, over 4283313.01 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:44:13,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=533994.0, ans=0.125 2023-06-20 09:44:45,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.97 vs. limit=8.0 2023-06-20 09:46:01,515 INFO [train.py:996] (0/4) Epoch 3, batch 28050, loss[loss=0.2149, simple_loss=0.2611, pruned_loss=0.08433, over 21172.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3183, pruned_loss=0.08767, over 4284792.53 frames. ], batch size: 143, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:46:42,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=534294.0, ans=0.0 2023-06-20 09:46:42,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=534294.0, ans=0.07 2023-06-20 09:46:45,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=534354.0, ans=0.125 2023-06-20 09:47:49,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=534474.0, ans=0.0 2023-06-20 09:47:50,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=534474.0, ans=0.2 2023-06-20 09:48:07,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.725e+02 3.043e+02 3.696e+02 6.736e+02, threshold=6.086e+02, percent-clipped=4.0 2023-06-20 09:48:07,131 INFO [train.py:996] (0/4) Epoch 3, batch 28100, loss[loss=0.2146, simple_loss=0.2696, pruned_loss=0.07983, over 21180.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3169, pruned_loss=0.0887, over 4277747.54 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:48:27,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.52 vs. limit=22.5 2023-06-20 09:48:35,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=534594.0, ans=0.05 2023-06-20 09:48:41,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=534594.0, ans=0.125 2023-06-20 09:48:44,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=534594.0, ans=0.2 2023-06-20 09:48:47,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=534654.0, ans=0.125 2023-06-20 09:48:48,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=534654.0, ans=0.0 2023-06-20 09:49:14,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534714.0, ans=0.1 2023-06-20 09:49:51,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=534834.0, ans=0.0 2023-06-20 09:49:51,904 INFO [train.py:996] (0/4) Epoch 3, batch 28150, loss[loss=0.2714, simple_loss=0.3006, pruned_loss=0.1212, over 21553.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3112, pruned_loss=0.08895, over 4282632.14 frames. ], batch size: 512, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:50:26,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=12.0 2023-06-20 09:50:59,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-20 09:51:52,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.680e+02 3.048e+02 3.566e+02 6.007e+02, threshold=6.096e+02, percent-clipped=0.0 2023-06-20 09:51:52,840 INFO [train.py:996] (0/4) Epoch 3, batch 28200, loss[loss=0.2681, simple_loss=0.3272, pruned_loss=0.1046, over 21713.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3098, pruned_loss=0.09077, over 4276340.43 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:52:09,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=535134.0, ans=0.5 2023-06-20 09:52:24,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-20 09:54:03,164 INFO [train.py:996] (0/4) Epoch 3, batch 28250, loss[loss=0.3083, simple_loss=0.3307, pruned_loss=0.1429, over 21288.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3158, pruned_loss=0.0947, over 4268924.41 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:54:23,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=535494.0, ans=0.2 2023-06-20 09:55:13,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-20 09:55:52,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.642e+02 2.920e+02 3.400e+02 5.282e+02, threshold=5.841e+02, percent-clipped=0.0 2023-06-20 09:55:52,618 INFO [train.py:996] (0/4) Epoch 3, batch 28300, loss[loss=0.2307, simple_loss=0.3089, pruned_loss=0.0762, over 21191.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.313, pruned_loss=0.09197, over 4260426.41 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:55:59,090 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:56:03,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535734.0, ans=0.1 2023-06-20 09:57:55,385 INFO [train.py:996] (0/4) Epoch 3, batch 28350, loss[loss=0.2206, simple_loss=0.2879, pruned_loss=0.07668, over 21665.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3078, pruned_loss=0.08534, over 4266869.82 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 09:58:01,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=536034.0, ans=0.125 2023-06-20 09:59:00,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=536214.0, ans=0.125 2023-06-20 09:59:01,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-20 09:59:57,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.442e+02 2.906e+02 3.756e+02 6.474e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-20 09:59:57,242 INFO [train.py:996] (0/4) Epoch 3, batch 28400, loss[loss=0.2491, simple_loss=0.3126, pruned_loss=0.09282, over 21657.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3045, pruned_loss=0.08477, over 4264678.20 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:00:16,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=536394.0, ans=0.125 2023-06-20 10:00:52,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=536454.0, ans=0.0 2023-06-20 10:01:54,090 INFO [train.py:996] (0/4) Epoch 3, batch 28450, loss[loss=0.2687, simple_loss=0.3314, pruned_loss=0.1029, over 21886.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3107, pruned_loss=0.08957, over 4273998.35 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:02:22,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=536634.0, ans=0.125 2023-06-20 10:03:29,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=536814.0, ans=0.0 2023-06-20 10:03:41,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=536814.0, ans=0.2 2023-06-20 10:04:20,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.862e+02 3.602e+02 4.332e+02 6.960e+02, threshold=7.204e+02, percent-clipped=5.0 2023-06-20 10:04:21,007 INFO [train.py:996] (0/4) Epoch 3, batch 28500, loss[loss=0.2793, simple_loss=0.3563, pruned_loss=0.1011, over 21477.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3128, pruned_loss=0.09172, over 4277781.38 frames. ], batch size: 131, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:04:48,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=536994.0, ans=0.125 2023-06-20 10:05:19,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=537054.0, ans=0.0 2023-06-20 10:05:30,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=537114.0, ans=0.125 2023-06-20 10:05:35,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.01 vs. limit=22.5 2023-06-20 10:06:02,417 INFO [train.py:996] (0/4) Epoch 3, batch 28550, loss[loss=0.2746, simple_loss=0.3596, pruned_loss=0.0948, over 21415.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3205, pruned_loss=0.09395, over 4280197.43 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:07:26,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=537414.0, ans=0.5 2023-06-20 10:07:56,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=537474.0, ans=0.125 2023-06-20 10:08:13,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.757e+02 3.378e+02 4.291e+02 7.271e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 10:08:13,384 INFO [train.py:996] (0/4) Epoch 3, batch 28600, loss[loss=0.2627, simple_loss=0.3403, pruned_loss=0.09252, over 21538.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3282, pruned_loss=0.09664, over 4279073.12 frames. ], batch size: 112, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:09:41,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537714.0, ans=0.0 2023-06-20 10:09:50,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=537774.0, ans=0.0 2023-06-20 10:10:12,631 INFO [train.py:996] (0/4) Epoch 3, batch 28650, loss[loss=0.2418, simple_loss=0.2952, pruned_loss=0.09421, over 21887.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.323, pruned_loss=0.09582, over 4273554.08 frames. ], batch size: 107, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:10:46,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-20 10:11:18,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=538014.0, ans=0.125 2023-06-20 10:11:30,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=538014.0, ans=0.125 2023-06-20 10:11:59,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=538074.0, ans=0.125 2023-06-20 10:12:15,804 INFO [train.py:996] (0/4) Epoch 3, batch 28700, loss[loss=0.2701, simple_loss=0.3348, pruned_loss=0.1027, over 21942.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3202, pruned_loss=0.09616, over 4271465.80 frames. ], batch size: 372, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:12:17,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.709e+02 3.341e+02 4.150e+02 7.060e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 10:12:34,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=538194.0, ans=0.125 2023-06-20 10:13:32,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=538314.0, ans=0.125 2023-06-20 10:14:19,921 INFO [train.py:996] (0/4) Epoch 3, batch 28750, loss[loss=0.2492, simple_loss=0.334, pruned_loss=0.08221, over 21848.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.321, pruned_loss=0.09638, over 4277011.70 frames. ], batch size: 371, lr: 1.01e-02, grad_scale: 16.0 2023-06-20 10:14:40,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=538494.0, ans=0.0 2023-06-20 10:14:44,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=538494.0, ans=0.125 2023-06-20 10:14:56,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=538554.0, ans=12.0 2023-06-20 10:15:11,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=538554.0, ans=15.0 2023-06-20 10:15:20,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=538554.0, ans=0.125 2023-06-20 10:15:41,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=538614.0, ans=0.0 2023-06-20 10:16:18,015 INFO [train.py:996] (0/4) Epoch 3, batch 28800, loss[loss=0.2881, simple_loss=0.3541, pruned_loss=0.1111, over 21857.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3244, pruned_loss=0.09681, over 4282521.74 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:16:18,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=538734.0, ans=0.125 2023-06-20 10:16:18,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538734.0, ans=0.125 2023-06-20 10:16:25,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.544e+02 3.077e+02 3.520e+02 7.771e+02, threshold=6.153e+02, percent-clipped=2.0 2023-06-20 10:16:57,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=538794.0, ans=0.0 2023-06-20 10:17:02,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538794.0, ans=0.125 2023-06-20 10:17:51,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-20 10:17:57,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=538914.0, ans=0.0 2023-06-20 10:18:25,602 INFO [train.py:996] (0/4) Epoch 3, batch 28850, loss[loss=0.3065, simple_loss=0.3588, pruned_loss=0.1271, over 21565.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3296, pruned_loss=0.09895, over 4276058.17 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:18:41,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=539094.0, ans=0.5 2023-06-20 10:20:23,126 INFO [train.py:996] (0/4) Epoch 3, batch 28900, loss[loss=0.2694, simple_loss=0.3335, pruned_loss=0.1026, over 21914.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3349, pruned_loss=0.1019, over 4280152.18 frames. ], batch size: 316, lr: 1.01e-02, grad_scale: 32.0 2023-06-20 10:20:24,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.799e+02 3.210e+02 3.937e+02 8.118e+02, threshold=6.420e+02, percent-clipped=2.0 2023-06-20 10:20:26,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=539334.0, ans=0.0 2023-06-20 10:20:48,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=539394.0, ans=0.125 2023-06-20 10:20:56,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=539394.0, ans=0.125 2023-06-20 10:21:42,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=539514.0, ans=0.0 2023-06-20 10:22:02,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=539574.0, ans=0.05 2023-06-20 10:22:08,518 INFO [train.py:996] (0/4) Epoch 3, batch 28950, loss[loss=0.2275, simple_loss=0.3236, pruned_loss=0.06566, over 21820.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3324, pruned_loss=0.1008, over 4274056.58 frames. ], batch size: 316, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:22:42,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=539694.0, ans=0.1 2023-06-20 10:22:43,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-06-20 10:22:57,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-20 10:22:57,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-20 10:23:43,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=539814.0, ans=0.125 2023-06-20 10:24:30,734 INFO [train.py:996] (0/4) Epoch 3, batch 29000, loss[loss=0.2677, simple_loss=0.3587, pruned_loss=0.08834, over 20805.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3351, pruned_loss=0.09981, over 4270267.59 frames. ], batch size: 608, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:24:32,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.768e+02 3.396e+02 4.275e+02 6.208e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-20 10:24:34,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=539934.0, ans=6.0 2023-06-20 10:24:54,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=539934.0, ans=0.0 2023-06-20 10:24:55,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=539934.0, ans=0.0 2023-06-20 10:25:24,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=540054.0, ans=0.0 2023-06-20 10:25:44,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 10:26:03,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=540114.0, ans=0.125 2023-06-20 10:26:34,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=540174.0, ans=0.125 2023-06-20 10:26:38,119 INFO [train.py:996] (0/4) Epoch 3, batch 29050, loss[loss=0.3174, simple_loss=0.3487, pruned_loss=0.1431, over 21782.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3349, pruned_loss=0.1002, over 4275605.52 frames. ], batch size: 508, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:27:10,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=540294.0, ans=0.0 2023-06-20 10:27:25,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=540354.0, ans=0.125 2023-06-20 10:28:10,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.88 vs. limit=22.5 2023-06-20 10:28:27,053 INFO [train.py:996] (0/4) Epoch 3, batch 29100, loss[loss=0.2123, simple_loss=0.2789, pruned_loss=0.0729, over 21778.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3258, pruned_loss=0.09679, over 4267814.46 frames. ], batch size: 351, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:28:34,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.763e+02 3.093e+02 3.779e+02 6.198e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-20 10:29:55,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=540774.0, ans=0.05 2023-06-20 10:30:09,101 INFO [train.py:996] (0/4) Epoch 3, batch 29150, loss[loss=0.264, simple_loss=0.3326, pruned_loss=0.09767, over 21229.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3231, pruned_loss=0.09474, over 4269279.42 frames. ], batch size: 548, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:30:17,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-20 10:31:07,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=541014.0, ans=0.125 2023-06-20 10:31:14,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=541014.0, ans=0.1 2023-06-20 10:31:51,689 INFO [train.py:996] (0/4) Epoch 3, batch 29200, loss[loss=0.2081, simple_loss=0.2661, pruned_loss=0.07501, over 21455.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3175, pruned_loss=0.09334, over 4266695.89 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:31:52,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=541134.0, ans=0.125 2023-06-20 10:31:53,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.609e+02 3.171e+02 4.055e+02 6.216e+02, threshold=6.341e+02, percent-clipped=1.0 2023-06-20 10:32:44,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=541254.0, ans=0.0 2023-06-20 10:33:03,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-20 10:34:03,934 INFO [train.py:996] (0/4) Epoch 3, batch 29250, loss[loss=0.2306, simple_loss=0.3109, pruned_loss=0.07518, over 21427.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3137, pruned_loss=0.08977, over 4260749.81 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:34:33,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-20 10:35:18,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=541614.0, ans=0.1 2023-06-20 10:35:24,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=541614.0, ans=0.0 2023-06-20 10:35:26,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-20 10:35:47,774 INFO [train.py:996] (0/4) Epoch 3, batch 29300, loss[loss=0.2523, simple_loss=0.331, pruned_loss=0.08678, over 21697.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3174, pruned_loss=0.08989, over 4263252.34 frames. ], batch size: 298, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:36:05,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.416e+02 2.645e+02 3.207e+02 5.648e+02, threshold=5.289e+02, percent-clipped=0.0 2023-06-20 10:36:36,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=541794.0, ans=0.0 2023-06-20 10:36:39,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=541794.0, ans=0.125 2023-06-20 10:36:40,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=541794.0, ans=0.0 2023-06-20 10:36:54,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=541854.0, ans=0.125 2023-06-20 10:37:00,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-20 10:37:50,506 INFO [train.py:996] (0/4) Epoch 3, batch 29350, loss[loss=0.2748, simple_loss=0.3504, pruned_loss=0.09961, over 21610.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3139, pruned_loss=0.08959, over 4263468.49 frames. ], batch size: 442, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:38:11,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=542034.0, ans=0.0 2023-06-20 10:38:31,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=542094.0, ans=0.125 2023-06-20 10:38:39,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=542154.0, ans=0.0 2023-06-20 10:39:12,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=542214.0, ans=0.125 2023-06-20 10:39:12,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2023-06-20 10:39:19,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=542214.0, ans=0.125 2023-06-20 10:39:20,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=542214.0, ans=0.125 2023-06-20 10:39:22,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=542214.0, ans=0.125 2023-06-20 10:39:30,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-20 10:39:59,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542274.0, ans=0.1 2023-06-20 10:40:03,541 INFO [train.py:996] (0/4) Epoch 3, batch 29400, loss[loss=0.2288, simple_loss=0.3016, pruned_loss=0.07795, over 20013.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3138, pruned_loss=0.08741, over 4265935.62 frames. ], batch size: 703, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:40:04,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.619e+02 2.949e+02 3.526e+02 5.601e+02, threshold=5.897e+02, percent-clipped=1.0 2023-06-20 10:40:33,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=542394.0, ans=0.2 2023-06-20 10:40:51,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-20 10:41:13,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=542514.0, ans=0.125 2023-06-20 10:41:48,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=542574.0, ans=0.2 2023-06-20 10:42:05,237 INFO [train.py:996] (0/4) Epoch 3, batch 29450, loss[loss=0.235, simple_loss=0.3184, pruned_loss=0.07578, over 19944.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3118, pruned_loss=0.08626, over 4263997.08 frames. ], batch size: 703, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:42:16,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.91 vs. limit=22.5 2023-06-20 10:42:23,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542634.0, ans=0.1 2023-06-20 10:43:21,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.09 vs. limit=22.5 2023-06-20 10:43:55,593 INFO [train.py:996] (0/4) Epoch 3, batch 29500, loss[loss=0.2386, simple_loss=0.3043, pruned_loss=0.08646, over 21939.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3159, pruned_loss=0.09018, over 4267836.30 frames. ], batch size: 333, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:43:57,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.681e+02 3.088e+02 3.658e+02 6.266e+02, threshold=6.176e+02, percent-clipped=1.0 2023-06-20 10:44:07,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=542934.0, ans=0.035 2023-06-20 10:44:07,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=542934.0, ans=0.2 2023-06-20 10:44:11,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542934.0, ans=0.1 2023-06-20 10:44:52,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=542994.0, ans=0.125 2023-06-20 10:45:10,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=543114.0, ans=0.125 2023-06-20 10:45:27,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=543114.0, ans=0.2 2023-06-20 10:46:06,823 INFO [train.py:996] (0/4) Epoch 3, batch 29550, loss[loss=0.2449, simple_loss=0.3059, pruned_loss=0.09196, over 21859.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3166, pruned_loss=0.09195, over 4273010.97 frames. ], batch size: 298, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:47:58,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543474.0, ans=0.125 2023-06-20 10:48:04,722 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:48:26,408 INFO [train.py:996] (0/4) Epoch 3, batch 29600, loss[loss=0.2642, simple_loss=0.3429, pruned_loss=0.09281, over 21474.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.323, pruned_loss=0.09432, over 4273981.41 frames. ], batch size: 194, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:48:27,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.045e+02 3.811e+02 4.571e+02 9.006e+02, threshold=7.623e+02, percent-clipped=4.0 2023-06-20 10:48:51,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-20 10:48:53,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543594.0, ans=0.1 2023-06-20 10:49:08,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543594.0, ans=0.125 2023-06-20 10:49:26,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=543654.0, ans=0.2 2023-06-20 10:49:26,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=543654.0, ans=0.125 2023-06-20 10:49:27,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=543654.0, ans=0.125 2023-06-20 10:49:32,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=543654.0, ans=0.0 2023-06-20 10:50:03,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-20 10:50:34,292 INFO [train.py:996] (0/4) Epoch 3, batch 29650, loss[loss=0.1919, simple_loss=0.2595, pruned_loss=0.06218, over 21586.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3192, pruned_loss=0.09037, over 4273779.65 frames. ], batch size: 230, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:50:36,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543834.0, ans=0.125 2023-06-20 10:51:17,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=543894.0, ans=0.1 2023-06-20 10:51:19,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-20 10:51:34,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-20 10:51:50,451 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:52:02,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-20 10:52:12,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544074.0, ans=0.1 2023-06-20 10:52:17,993 INFO [train.py:996] (0/4) Epoch 3, batch 29700, loss[loss=0.2706, simple_loss=0.3508, pruned_loss=0.0952, over 21184.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3215, pruned_loss=0.09061, over 4280173.71 frames. ], batch size: 143, lr: 1.00e-02, grad_scale: 32.0 2023-06-20 10:52:19,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.285e+02 2.545e+02 2.906e+02 5.391e+02, threshold=5.090e+02, percent-clipped=0.0 2023-06-20 10:53:53,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=544374.0, ans=0.0 2023-06-20 10:54:13,213 INFO [train.py:996] (0/4) Epoch 3, batch 29750, loss[loss=0.2355, simple_loss=0.3183, pruned_loss=0.07632, over 21327.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3256, pruned_loss=0.09044, over 4275704.74 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:54:15,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=544434.0, ans=0.2 2023-06-20 10:55:37,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-20 10:55:48,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-06-20 10:55:57,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-20 10:56:14,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=544674.0, ans=0.0 2023-06-20 10:56:17,043 INFO [train.py:996] (0/4) Epoch 3, batch 29800, loss[loss=0.2496, simple_loss=0.3149, pruned_loss=0.0922, over 21453.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3284, pruned_loss=0.09261, over 4284364.01 frames. ], batch size: 194, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:56:28,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.744e+02 3.274e+02 3.878e+02 6.407e+02, threshold=6.548e+02, percent-clipped=7.0 2023-06-20 10:56:30,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=544734.0, ans=0.0 2023-06-20 10:56:55,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=544794.0, ans=0.125 2023-06-20 10:56:59,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=544794.0, ans=0.125 2023-06-20 10:57:15,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=544794.0, ans=0.025 2023-06-20 10:57:43,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=544914.0, ans=0.125 2023-06-20 10:57:56,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=544974.0, ans=0.125 2023-06-20 10:58:16,254 INFO [train.py:996] (0/4) Epoch 3, batch 29850, loss[loss=0.2186, simple_loss=0.2867, pruned_loss=0.07525, over 21752.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3248, pruned_loss=0.09014, over 4287413.80 frames. ], batch size: 231, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 10:58:43,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=545034.0, ans=0.125 2023-06-20 10:58:43,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=545034.0, ans=0.2 2023-06-20 10:59:05,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=545154.0, ans=0.035 2023-06-20 10:59:33,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.52 vs. limit=6.0 2023-06-20 10:59:49,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-20 11:00:18,350 INFO [train.py:996] (0/4) Epoch 3, batch 29900, loss[loss=0.1984, simple_loss=0.2752, pruned_loss=0.06079, over 20826.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3218, pruned_loss=0.091, over 4290575.34 frames. ], batch size: 608, lr: 1.00e-02, grad_scale: 16.0 2023-06-20 11:00:21,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.554e+02 2.921e+02 3.179e+02 4.891e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-20 11:01:08,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=545454.0, ans=0.125 2023-06-20 11:01:15,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-20 11:01:28,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=545514.0, ans=0.125 2023-06-20 11:02:07,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=545574.0, ans=0.0 2023-06-20 11:02:27,193 INFO [train.py:996] (0/4) Epoch 3, batch 29950, loss[loss=0.2851, simple_loss=0.3504, pruned_loss=0.1099, over 21519.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.327, pruned_loss=0.09656, over 4293105.44 frames. ], batch size: 194, lr: 9.99e-03, grad_scale: 16.0 2023-06-20 11:03:38,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=545754.0, ans=0.125 2023-06-20 11:03:57,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.28 vs. limit=15.0 2023-06-20 11:04:16,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545874.0, ans=0.1 2023-06-20 11:04:41,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=545874.0, ans=0.04949747468305833 2023-06-20 11:04:42,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=545934.0, ans=0.0 2023-06-20 11:04:43,779 INFO [train.py:996] (0/4) Epoch 3, batch 30000, loss[loss=0.2226, simple_loss=0.314, pruned_loss=0.06559, over 21796.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3298, pruned_loss=0.09671, over 4290161.63 frames. ], batch size: 282, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:04:43,780 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 11:05:44,593 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2515, simple_loss=0.3537, pruned_loss=0.07464, over 1796401.00 frames. 2023-06-20 11:05:44,596 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 11:05:47,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.773e+02 3.132e+02 3.473e+02 5.556e+02, threshold=6.264e+02, percent-clipped=0.0 2023-06-20 11:06:12,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=545994.0, ans=0.0 2023-06-20 11:06:31,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=546054.0, ans=0.2 2023-06-20 11:06:33,578 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:07:04,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546114.0, ans=0.1 2023-06-20 11:07:51,366 INFO [train.py:996] (0/4) Epoch 3, batch 30050, loss[loss=0.211, simple_loss=0.3373, pruned_loss=0.04233, over 19825.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.334, pruned_loss=0.09412, over 4285442.91 frames. ], batch size: 702, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:07:53,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=546234.0, ans=0.125 2023-06-20 11:07:56,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=546234.0, ans=0.125 2023-06-20 11:09:38,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-20 11:09:38,688 INFO [train.py:996] (0/4) Epoch 3, batch 30100, loss[loss=0.2503, simple_loss=0.3052, pruned_loss=0.09769, over 21867.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3302, pruned_loss=0.09298, over 4275919.44 frames. ], batch size: 98, lr: 9.99e-03, grad_scale: 32.0 2023-06-20 11:09:41,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.525e+02 3.109e+02 3.772e+02 7.845e+02, threshold=6.218e+02, percent-clipped=1.0 2023-06-20 11:10:01,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-20 11:10:08,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=546594.0, ans=0.125 2023-06-20 11:10:22,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=546654.0, ans=0.0 2023-06-20 11:11:23,389 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:11:25,729 INFO [train.py:996] (0/4) Epoch 3, batch 30150, loss[loss=0.2544, simple_loss=0.2982, pruned_loss=0.1053, over 20136.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3256, pruned_loss=0.09438, over 4257684.72 frames. ], batch size: 702, lr: 9.98e-03, grad_scale: 32.0 2023-06-20 11:12:53,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=547014.0, ans=0.125 2023-06-20 11:13:28,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=547074.0, ans=0.125 2023-06-20 11:13:36,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.97 vs. limit=12.0 2023-06-20 11:13:42,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=547134.0, ans=0.0 2023-06-20 11:13:43,231 INFO [train.py:996] (0/4) Epoch 3, batch 30200, loss[loss=0.2491, simple_loss=0.327, pruned_loss=0.08558, over 21729.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3275, pruned_loss=0.0931, over 4261055.29 frames. ], batch size: 124, lr: 9.98e-03, grad_scale: 32.0 2023-06-20 11:13:46,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.482e+02 2.831e+02 3.246e+02 4.619e+02, threshold=5.661e+02, percent-clipped=0.0 2023-06-20 11:14:32,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=547254.0, ans=0.125 2023-06-20 11:15:18,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=547314.0, ans=0.5 2023-06-20 11:15:58,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547374.0, ans=0.1 2023-06-20 11:16:00,891 INFO [train.py:996] (0/4) Epoch 3, batch 30250, loss[loss=0.3595, simple_loss=0.4319, pruned_loss=0.1435, over 21532.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3364, pruned_loss=0.09527, over 4264951.39 frames. ], batch size: 471, lr: 9.98e-03, grad_scale: 16.0 2023-06-20 11:18:02,148 INFO [train.py:996] (0/4) Epoch 3, batch 30300, loss[loss=0.2207, simple_loss=0.2823, pruned_loss=0.07957, over 21746.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3342, pruned_loss=0.09501, over 4265961.57 frames. ], batch size: 317, lr: 9.97e-03, grad_scale: 16.0 2023-06-20 11:18:06,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.736e+02 3.183e+02 3.962e+02 5.943e+02, threshold=6.366e+02, percent-clipped=1.0 2023-06-20 11:18:21,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=547794.0, ans=0.0 2023-06-20 11:18:23,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=547794.0, ans=0.04949747468305833 2023-06-20 11:19:21,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=547914.0, ans=0.125 2023-06-20 11:20:01,347 INFO [train.py:996] (0/4) Epoch 3, batch 30350, loss[loss=0.3486, simple_loss=0.4105, pruned_loss=0.1433, over 21569.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3312, pruned_loss=0.09553, over 4263310.70 frames. ], batch size: 473, lr: 9.97e-03, grad_scale: 16.0 2023-06-20 11:20:02,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-20 11:20:40,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=548094.0, ans=0.125 2023-06-20 11:21:42,583 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:21:46,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=548214.0, ans=0.0 2023-06-20 11:22:41,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.34 vs. limit=22.5 2023-06-20 11:22:57,197 INFO [train.py:996] (0/4) Epoch 3, batch 30400, loss[loss=0.2544, simple_loss=0.2968, pruned_loss=0.106, over 20274.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3227, pruned_loss=0.09297, over 4250176.00 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-20 11:23:00,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.083e+02 3.599e+02 4.389e+02 8.139e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-20 11:24:58,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-20 11:26:34,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=548574.0, ans=0.0 2023-06-20 11:26:36,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-20 11:26:53,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=548634.0, ans=0.125 2023-06-20 11:26:54,478 INFO [train.py:996] (0/4) Epoch 3, batch 30450, loss[loss=0.2841, simple_loss=0.3714, pruned_loss=0.09838, over 19939.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3244, pruned_loss=0.09383, over 4193503.29 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-20 11:27:36,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=548634.0, ans=0.125 2023-06-20 11:27:37,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548634.0, ans=0.1 2023-06-20 11:28:35,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548754.0, ans=0.1 2023-06-20 11:28:56,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548754.0, ans=0.1 2023-06-20 11:29:53,233 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-3.pt 2023-06-20 11:32:21,683 INFO [train.py:996] (0/4) Epoch 4, batch 0, loss[loss=0.2762, simple_loss=0.3288, pruned_loss=0.1118, over 21537.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3288, pruned_loss=0.1118, over 21537.00 frames. ], batch size: 391, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:32:21,685 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 11:33:10,763 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2494, simple_loss=0.3589, pruned_loss=0.06994, over 1796401.00 frames. 2023-06-20 11:33:10,764 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 11:33:23,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 4.575e+02 6.276e+02 9.904e+02 2.096e+03, threshold=1.255e+03, percent-clipped=39.0 2023-06-20 11:34:02,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=549024.0, ans=0.2 2023-06-20 11:34:31,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-20 11:34:51,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=549144.0, ans=0.125 2023-06-20 11:34:53,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=549204.0, ans=22.5 2023-06-20 11:34:53,774 INFO [train.py:996] (0/4) Epoch 4, batch 50, loss[loss=0.2274, simple_loss=0.3136, pruned_loss=0.07059, over 21597.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.328, pruned_loss=0.09062, over 963698.75 frames. ], batch size: 230, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:35:19,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=549204.0, ans=0.125 2023-06-20 11:35:26,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-06-20 11:35:27,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=549264.0, ans=0.09899494936611666 2023-06-20 11:36:16,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=549384.0, ans=0.0 2023-06-20 11:36:45,526 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:36:59,324 INFO [train.py:996] (0/4) Epoch 4, batch 100, loss[loss=0.2682, simple_loss=0.3738, pruned_loss=0.08135, over 21455.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3487, pruned_loss=0.09434, over 1692405.05 frames. ], batch size: 211, lr: 8.60e-03, grad_scale: 32.0 2023-06-20 11:37:15,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549504.0, ans=0.125 2023-06-20 11:37:24,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.415e+02 2.758e+02 3.125e+02 7.692e+02, threshold=5.515e+02, percent-clipped=0.0 2023-06-20 11:37:27,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=549564.0, ans=0.125 2023-06-20 11:37:58,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=549624.0, ans=0.0 2023-06-20 11:38:24,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=549684.0, ans=0.0 2023-06-20 11:38:47,461 INFO [train.py:996] (0/4) Epoch 4, batch 150, loss[loss=0.2569, simple_loss=0.3568, pruned_loss=0.07849, over 21820.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3464, pruned_loss=0.09248, over 2266771.51 frames. ], batch size: 316, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:39:05,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549864.0, ans=0.1 2023-06-20 11:40:11,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=549984.0, ans=0.0 2023-06-20 11:40:46,820 INFO [train.py:996] (0/4) Epoch 4, batch 200, loss[loss=0.2757, simple_loss=0.3447, pruned_loss=0.1034, over 21482.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3419, pruned_loss=0.09138, over 2716223.45 frames. ], batch size: 131, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:40:54,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=550104.0, ans=0.125 2023-06-20 11:41:04,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.490e+02 2.754e+02 3.308e+02 4.592e+02, threshold=5.508e+02, percent-clipped=0.0 2023-06-20 11:41:13,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=550164.0, ans=0.125 2023-06-20 11:41:16,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=550164.0, ans=0.125 2023-06-20 11:41:34,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=550224.0, ans=0.125 2023-06-20 11:41:36,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-20 11:42:17,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=22.5 2023-06-20 11:42:45,767 INFO [train.py:996] (0/4) Epoch 4, batch 250, loss[loss=0.2359, simple_loss=0.2989, pruned_loss=0.08643, over 21736.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3383, pruned_loss=0.09216, over 3059332.35 frames. ], batch size: 124, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:43:50,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-20 11:43:51,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=550524.0, ans=0.125 2023-06-20 11:44:39,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=550584.0, ans=0.125 2023-06-20 11:45:28,959 INFO [train.py:996] (0/4) Epoch 4, batch 300, loss[loss=0.2429, simple_loss=0.33, pruned_loss=0.07794, over 21696.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3349, pruned_loss=0.0932, over 3323191.11 frames. ], batch size: 298, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:45:43,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=550704.0, ans=0.125 2023-06-20 11:45:47,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.545e+02 3.030e+02 3.596e+02 5.664e+02, threshold=6.060e+02, percent-clipped=1.0 2023-06-20 11:46:26,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=550824.0, ans=0.05 2023-06-20 11:47:39,857 INFO [train.py:996] (0/4) Epoch 4, batch 350, loss[loss=0.2614, simple_loss=0.3571, pruned_loss=0.08285, over 21649.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3292, pruned_loss=0.09272, over 3539563.65 frames. ], batch size: 389, lr: 8.59e-03, grad_scale: 32.0 2023-06-20 11:48:55,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=551124.0, ans=0.2 2023-06-20 11:50:05,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=551244.0, ans=0.125 2023-06-20 11:50:12,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-20 11:50:13,572 INFO [train.py:996] (0/4) Epoch 4, batch 400, loss[loss=0.225, simple_loss=0.2783, pruned_loss=0.0858, over 21245.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.321, pruned_loss=0.08973, over 3699748.84 frames. ], batch size: 160, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:50:26,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=551304.0, ans=0.0 2023-06-20 11:50:36,579 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.584e+02 2.891e+02 3.548e+02 6.771e+02, threshold=5.782e+02, percent-clipped=1.0 2023-06-20 11:51:44,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=551484.0, ans=0.025 2023-06-20 11:51:46,895 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:51:49,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=551544.0, ans=0.125 2023-06-20 11:52:30,852 INFO [train.py:996] (0/4) Epoch 4, batch 450, loss[loss=0.2182, simple_loss=0.2603, pruned_loss=0.08801, over 20261.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3172, pruned_loss=0.08799, over 3829307.73 frames. ], batch size: 702, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:54:07,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=551844.0, ans=0.0 2023-06-20 11:54:32,207 INFO [train.py:996] (0/4) Epoch 4, batch 500, loss[loss=0.2583, simple_loss=0.3577, pruned_loss=0.07944, over 21634.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3186, pruned_loss=0.08693, over 3933852.07 frames. ], batch size: 441, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:54:57,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.704e+02 3.098e+02 4.611e+02 6.929e+02, threshold=6.196e+02, percent-clipped=8.0 2023-06-20 11:55:11,597 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-92000.pt 2023-06-20 11:55:22,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=552024.0, ans=0.2 2023-06-20 11:55:57,235 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:56:23,596 INFO [train.py:996] (0/4) Epoch 4, batch 550, loss[loss=0.2836, simple_loss=0.3817, pruned_loss=0.09276, over 21758.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3239, pruned_loss=0.08809, over 4020258.53 frames. ], batch size: 351, lr: 8.58e-03, grad_scale: 32.0 2023-06-20 11:57:04,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=552264.0, ans=0.0 2023-06-20 11:57:28,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=552324.0, ans=0.0 2023-06-20 11:57:38,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-06-20 11:57:46,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=552384.0, ans=0.125 2023-06-20 11:58:23,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552444.0, ans=0.1 2023-06-20 11:58:44,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-20 11:58:45,244 INFO [train.py:996] (0/4) Epoch 4, batch 600, loss[loss=0.2586, simple_loss=0.337, pruned_loss=0.09006, over 21683.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3277, pruned_loss=0.0881, over 4076951.68 frames. ], batch size: 247, lr: 8.57e-03, grad_scale: 32.0 2023-06-20 11:58:58,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.831e+02 3.316e+02 4.076e+02 6.310e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-20 11:59:33,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552624.0, ans=0.1 2023-06-20 12:00:18,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=552744.0, ans=0.2 2023-06-20 12:00:22,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=552804.0, ans=0.125 2023-06-20 12:00:29,235 INFO [train.py:996] (0/4) Epoch 4, batch 650, loss[loss=0.2525, simple_loss=0.3181, pruned_loss=0.09342, over 21814.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3277, pruned_loss=0.08901, over 4124587.45 frames. ], batch size: 102, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:00:39,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=552804.0, ans=0.0 2023-06-20 12:00:54,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-20 12:01:04,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=552864.0, ans=0.2 2023-06-20 12:01:29,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=552924.0, ans=0.0 2023-06-20 12:02:45,253 INFO [train.py:996] (0/4) Epoch 4, batch 700, loss[loss=0.2102, simple_loss=0.2699, pruned_loss=0.07532, over 21634.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3265, pruned_loss=0.08946, over 4150839.63 frames. ], batch size: 247, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:02:55,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=553104.0, ans=0.125 2023-06-20 12:02:57,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-20 12:02:59,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.784e+02 3.467e+02 4.888e+02 7.822e+02, threshold=6.935e+02, percent-clipped=3.0 2023-06-20 12:03:32,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-20 12:03:55,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553344.0, ans=0.1 2023-06-20 12:04:16,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=553344.0, ans=0.125 2023-06-20 12:04:28,795 INFO [train.py:996] (0/4) Epoch 4, batch 750, loss[loss=0.216, simple_loss=0.285, pruned_loss=0.07349, over 21752.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3255, pruned_loss=0.0893, over 4180324.99 frames. ], batch size: 298, lr: 8.57e-03, grad_scale: 16.0 2023-06-20 12:04:50,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=553464.0, ans=0.05 2023-06-20 12:05:01,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=553464.0, ans=0.0 2023-06-20 12:05:29,074 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:05:42,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-20 12:05:49,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=553644.0, ans=0.04949747468305833 2023-06-20 12:06:08,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=553644.0, ans=0.0 2023-06-20 12:06:12,195 INFO [train.py:996] (0/4) Epoch 4, batch 800, loss[loss=0.2611, simple_loss=0.3625, pruned_loss=0.07982, over 21341.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3228, pruned_loss=0.08953, over 4211980.49 frames. ], batch size: 548, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:06:43,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.671e+02 3.153e+02 3.774e+02 5.879e+02, threshold=6.307e+02, percent-clipped=0.0 2023-06-20 12:07:01,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=553764.0, ans=0.125 2023-06-20 12:07:09,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=553764.0, ans=0.2 2023-06-20 12:07:16,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=553824.0, ans=0.125 2023-06-20 12:07:31,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=553824.0, ans=0.125 2023-06-20 12:07:40,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=553884.0, ans=0.5 2023-06-20 12:08:29,752 INFO [train.py:996] (0/4) Epoch 4, batch 850, loss[loss=0.2617, simple_loss=0.3242, pruned_loss=0.09963, over 21881.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3204, pruned_loss=0.08938, over 4229639.30 frames. ], batch size: 351, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:08:41,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-20 12:09:41,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=554244.0, ans=0.0 2023-06-20 12:10:06,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-20 12:10:07,120 INFO [train.py:996] (0/4) Epoch 4, batch 900, loss[loss=0.2183, simple_loss=0.3095, pruned_loss=0.06353, over 21790.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3192, pruned_loss=0.08901, over 4244932.64 frames. ], batch size: 332, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:10:09,163 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:10:10,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=554304.0, ans=0.125 2023-06-20 12:10:36,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=554304.0, ans=0.125 2023-06-20 12:10:42,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.695e+02 3.057e+02 3.508e+02 5.893e+02, threshold=6.115e+02, percent-clipped=0.0 2023-06-20 12:11:02,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-20 12:11:04,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-20 12:11:09,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=554424.0, ans=0.125 2023-06-20 12:11:25,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-20 12:11:45,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=554544.0, ans=0.2 2023-06-20 12:12:17,117 INFO [train.py:996] (0/4) Epoch 4, batch 950, loss[loss=0.2638, simple_loss=0.3331, pruned_loss=0.09725, over 21796.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3161, pruned_loss=0.08702, over 4260490.82 frames. ], batch size: 414, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:12:17,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=554604.0, ans=0.0 2023-06-20 12:13:03,045 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:13:53,072 INFO [train.py:996] (0/4) Epoch 4, batch 1000, loss[loss=0.2567, simple_loss=0.3363, pruned_loss=0.08855, over 21851.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3155, pruned_loss=0.08714, over 4272922.20 frames. ], batch size: 316, lr: 8.56e-03, grad_scale: 32.0 2023-06-20 12:14:14,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.396e+02 2.779e+02 3.341e+02 4.374e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-20 12:14:14,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=554964.0, ans=0.2 2023-06-20 12:14:33,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=554964.0, ans=0.125 2023-06-20 12:14:53,530 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:15:37,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=12.0 2023-06-20 12:15:57,393 INFO [train.py:996] (0/4) Epoch 4, batch 1050, loss[loss=0.2411, simple_loss=0.3252, pruned_loss=0.07849, over 21835.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.315, pruned_loss=0.087, over 4282997.61 frames. ], batch size: 316, lr: 8.55e-03, grad_scale: 32.0 2023-06-20 12:16:13,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555204.0, ans=0.1 2023-06-20 12:16:32,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=555264.0, ans=0.0 2023-06-20 12:17:26,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=555384.0, ans=0.0 2023-06-20 12:17:34,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-20 12:18:12,958 INFO [train.py:996] (0/4) Epoch 4, batch 1100, loss[loss=0.2409, simple_loss=0.3255, pruned_loss=0.07811, over 21786.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3149, pruned_loss=0.086, over 4279758.40 frames. ], batch size: 371, lr: 8.55e-03, grad_scale: 16.0 2023-06-20 12:18:27,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-20 12:18:29,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.724e+02 3.414e+02 3.990e+02 8.036e+02, threshold=6.829e+02, percent-clipped=6.0 2023-06-20 12:18:32,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=555564.0, ans=0.025 2023-06-20 12:18:48,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555624.0, ans=0.125 2023-06-20 12:18:54,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=555624.0, ans=0.0 2023-06-20 12:19:20,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=555684.0, ans=0.0 2023-06-20 12:19:32,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-20 12:19:51,994 INFO [train.py:996] (0/4) Epoch 4, batch 1150, loss[loss=0.176, simple_loss=0.2301, pruned_loss=0.06094, over 16693.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3168, pruned_loss=0.0859, over 4284341.55 frames. ], batch size: 60, lr: 8.55e-03, grad_scale: 16.0 2023-06-20 12:20:22,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=555804.0, ans=0.125 2023-06-20 12:20:46,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-20 12:21:06,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=555924.0, ans=0.0 2023-06-20 12:21:14,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=555984.0, ans=0.125 2023-06-20 12:21:56,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=556044.0, ans=0.0 2023-06-20 12:22:03,441 INFO [train.py:996] (0/4) Epoch 4, batch 1200, loss[loss=0.2839, simple_loss=0.3586, pruned_loss=0.1047, over 21501.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3175, pruned_loss=0.08729, over 4284163.15 frames. ], batch size: 131, lr: 8.55e-03, grad_scale: 32.0 2023-06-20 12:22:04,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=556104.0, ans=0.0 2023-06-20 12:22:05,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=556104.0, ans=0.2 2023-06-20 12:22:33,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.663e+02 2.998e+02 3.388e+02 6.421e+02, threshold=5.997e+02, percent-clipped=0.0 2023-06-20 12:22:54,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556164.0, ans=0.1 2023-06-20 12:24:02,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=556344.0, ans=0.0 2023-06-20 12:24:02,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=556344.0, ans=0.125 2023-06-20 12:24:05,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=556344.0, ans=0.125 2023-06-20 12:24:08,860 INFO [train.py:996] (0/4) Epoch 4, batch 1250, loss[loss=0.2603, simple_loss=0.3332, pruned_loss=0.09372, over 21693.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.319, pruned_loss=0.08828, over 4288803.68 frames. ], batch size: 351, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:24:36,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=556404.0, ans=0.0 2023-06-20 12:24:39,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=556464.0, ans=0.125 2023-06-20 12:25:03,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-20 12:26:17,531 INFO [train.py:996] (0/4) Epoch 4, batch 1300, loss[loss=0.2294, simple_loss=0.3121, pruned_loss=0.07334, over 21293.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3204, pruned_loss=0.08939, over 4289735.90 frames. ], batch size: 176, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:26:43,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.594e+02 2.933e+02 3.629e+02 6.395e+02, threshold=5.867e+02, percent-clipped=1.0 2023-06-20 12:27:57,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=556884.0, ans=0.0 2023-06-20 12:28:26,549 INFO [train.py:996] (0/4) Epoch 4, batch 1350, loss[loss=0.3064, simple_loss=0.3479, pruned_loss=0.1325, over 21792.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3207, pruned_loss=0.09021, over 4288220.27 frames. ], batch size: 507, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:28:44,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-20 12:28:58,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=557064.0, ans=0.125 2023-06-20 12:30:31,699 INFO [train.py:996] (0/4) Epoch 4, batch 1400, loss[loss=0.2246, simple_loss=0.2785, pruned_loss=0.08531, over 15056.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3174, pruned_loss=0.08951, over 4288288.88 frames. ], batch size: 60, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:30:58,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.685e+02 2.934e+02 3.988e+02 7.934e+02, threshold=5.867e+02, percent-clipped=7.0 2023-06-20 12:31:09,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=557364.0, ans=0.125 2023-06-20 12:31:38,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.25 vs. limit=15.0 2023-06-20 12:32:38,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-20 12:32:45,063 INFO [train.py:996] (0/4) Epoch 4, batch 1450, loss[loss=0.2564, simple_loss=0.3211, pruned_loss=0.09587, over 21554.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3158, pruned_loss=0.08941, over 4291245.61 frames. ], batch size: 230, lr: 8.54e-03, grad_scale: 32.0 2023-06-20 12:33:05,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=557664.0, ans=0.0 2023-06-20 12:33:20,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=557664.0, ans=0.125 2023-06-20 12:34:11,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-20 12:34:18,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=557844.0, ans=0.0 2023-06-20 12:34:42,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=557844.0, ans=0.2 2023-06-20 12:34:53,735 INFO [train.py:996] (0/4) Epoch 4, batch 1500, loss[loss=0.2385, simple_loss=0.2972, pruned_loss=0.08992, over 21203.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3193, pruned_loss=0.0912, over 4291254.36 frames. ], batch size: 608, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:35:18,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.667e+02 3.053e+02 3.442e+02 4.744e+02, threshold=6.106e+02, percent-clipped=0.0 2023-06-20 12:36:08,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=558024.0, ans=0.125 2023-06-20 12:36:14,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558084.0, ans=0.1 2023-06-20 12:37:17,107 INFO [train.py:996] (0/4) Epoch 4, batch 1550, loss[loss=0.1964, simple_loss=0.2872, pruned_loss=0.05279, over 21692.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3176, pruned_loss=0.09029, over 4285126.71 frames. ], batch size: 247, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:38:07,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=558324.0, ans=0.0 2023-06-20 12:38:17,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=558384.0, ans=0.125 2023-06-20 12:38:30,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=558384.0, ans=0.2 2023-06-20 12:38:32,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=558384.0, ans=0.0 2023-06-20 12:39:13,417 INFO [train.py:996] (0/4) Epoch 4, batch 1600, loss[loss=0.2325, simple_loss=0.288, pruned_loss=0.08846, over 21078.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.317, pruned_loss=0.08869, over 4280463.12 frames. ], batch size: 143, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:39:29,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.678e+02 3.179e+02 3.660e+02 5.340e+02, threshold=6.358e+02, percent-clipped=0.0 2023-06-20 12:40:34,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-20 12:41:19,039 INFO [train.py:996] (0/4) Epoch 4, batch 1650, loss[loss=0.2664, simple_loss=0.3321, pruned_loss=0.1004, over 20775.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3159, pruned_loss=0.08806, over 4278296.21 frames. ], batch size: 607, lr: 8.53e-03, grad_scale: 32.0 2023-06-20 12:41:36,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=558804.0, ans=0.125 2023-06-20 12:41:36,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=558804.0, ans=0.125 2023-06-20 12:42:25,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=558924.0, ans=0.04949747468305833 2023-06-20 12:43:09,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-20 12:43:20,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=559044.0, ans=0.125 2023-06-20 12:43:21,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=559044.0, ans=0.95 2023-06-20 12:43:24,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=559044.0, ans=0.125 2023-06-20 12:43:30,334 INFO [train.py:996] (0/4) Epoch 4, batch 1700, loss[loss=0.2456, simple_loss=0.304, pruned_loss=0.09364, over 21596.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3208, pruned_loss=0.09036, over 4282435.59 frames. ], batch size: 548, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:43:58,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.589e+02 2.821e+02 3.325e+02 4.601e+02, threshold=5.642e+02, percent-clipped=0.0 2023-06-20 12:43:59,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=559164.0, ans=0.125 2023-06-20 12:44:57,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559284.0, ans=0.125 2023-06-20 12:45:18,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=559284.0, ans=0.125 2023-06-20 12:45:24,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559344.0, ans=0.1 2023-06-20 12:45:33,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=559344.0, ans=0.2 2023-06-20 12:45:39,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=559344.0, ans=0.125 2023-06-20 12:45:49,022 INFO [train.py:996] (0/4) Epoch 4, batch 1750, loss[loss=0.1786, simple_loss=0.2651, pruned_loss=0.04604, over 21801.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3211, pruned_loss=0.08885, over 4279301.03 frames. ], batch size: 282, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:45:50,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=559404.0, ans=0.0 2023-06-20 12:46:15,367 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:47:00,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=559524.0, ans=0.1 2023-06-20 12:47:18,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559524.0, ans=0.125 2023-06-20 12:48:18,423 INFO [train.py:996] (0/4) Epoch 4, batch 1800, loss[loss=0.1457, simple_loss=0.1943, pruned_loss=0.04862, over 21815.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3148, pruned_loss=0.08426, over 4283973.43 frames. ], batch size: 102, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:48:30,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=559704.0, ans=0.125 2023-06-20 12:48:34,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.695e+02 3.353e+02 3.851e+02 6.055e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-20 12:50:07,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=559944.0, ans=0.125 2023-06-20 12:50:10,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=559944.0, ans=0.2 2023-06-20 12:50:15,943 INFO [train.py:996] (0/4) Epoch 4, batch 1850, loss[loss=0.2022, simple_loss=0.2903, pruned_loss=0.05707, over 21744.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.317, pruned_loss=0.08318, over 4285948.86 frames. ], batch size: 124, lr: 8.52e-03, grad_scale: 32.0 2023-06-20 12:50:22,291 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:50:51,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=560064.0, ans=0.0 2023-06-20 12:50:53,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-20 12:51:55,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=560184.0, ans=0.1 2023-06-20 12:52:18,570 INFO [train.py:996] (0/4) Epoch 4, batch 1900, loss[loss=0.2396, simple_loss=0.3175, pruned_loss=0.0808, over 21821.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3175, pruned_loss=0.08432, over 4289845.20 frames. ], batch size: 351, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:52:40,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.526e+02 2.932e+02 3.560e+02 6.916e+02, threshold=5.863e+02, percent-clipped=1.0 2023-06-20 12:52:42,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-20 12:52:43,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560364.0, ans=0.1 2023-06-20 12:53:04,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=560364.0, ans=0.125 2023-06-20 12:53:39,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=560484.0, ans=0.2 2023-06-20 12:54:36,905 INFO [train.py:996] (0/4) Epoch 4, batch 1950, loss[loss=0.2117, simple_loss=0.2725, pruned_loss=0.07548, over 21821.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3148, pruned_loss=0.08404, over 4284731.71 frames. ], batch size: 125, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:55:19,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=560724.0, ans=0.125 2023-06-20 12:56:40,391 INFO [train.py:996] (0/4) Epoch 4, batch 2000, loss[loss=0.1629, simple_loss=0.2285, pruned_loss=0.04866, over 21341.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3084, pruned_loss=0.08168, over 4281404.77 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 12:56:45,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=560904.0, ans=0.0 2023-06-20 12:56:48,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=560904.0, ans=0.125 2023-06-20 12:57:18,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.659e+02 3.074e+02 3.933e+02 7.372e+02, threshold=6.149e+02, percent-clipped=6.0 2023-06-20 12:58:37,539 INFO [train.py:996] (0/4) Epoch 4, batch 2050, loss[loss=0.2399, simple_loss=0.3136, pruned_loss=0.08313, over 21870.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3116, pruned_loss=0.08271, over 4281106.45 frames. ], batch size: 351, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 13:00:09,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=561384.0, ans=0.125 2023-06-20 13:00:46,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=561444.0, ans=0.1 2023-06-20 13:01:09,989 INFO [train.py:996] (0/4) Epoch 4, batch 2100, loss[loss=0.2854, simple_loss=0.3581, pruned_loss=0.1064, over 21289.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3163, pruned_loss=0.08577, over 4288524.38 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 32.0 2023-06-20 13:01:37,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.531e+02 2.915e+02 3.438e+02 5.326e+02, threshold=5.830e+02, percent-clipped=0.0 2023-06-20 13:02:07,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-20 13:02:08,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=561624.0, ans=0.2 2023-06-20 13:03:06,316 INFO [train.py:996] (0/4) Epoch 4, batch 2150, loss[loss=0.2555, simple_loss=0.3058, pruned_loss=0.1026, over 21863.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3158, pruned_loss=0.08676, over 4277900.27 frames. ], batch size: 107, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:04:21,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=561924.0, ans=0.0 2023-06-20 13:04:34,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=561984.0, ans=0.5 2023-06-20 13:04:52,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=561984.0, ans=0.2 2023-06-20 13:05:32,864 INFO [train.py:996] (0/4) Epoch 4, batch 2200, loss[loss=0.2404, simple_loss=0.3257, pruned_loss=0.07757, over 21743.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3182, pruned_loss=0.0858, over 4276815.30 frames. ], batch size: 298, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:05:43,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=562104.0, ans=0.125 2023-06-20 13:05:45,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=562104.0, ans=0.0 2023-06-20 13:06:00,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.421e+02 2.845e+02 3.269e+02 4.462e+02, threshold=5.690e+02, percent-clipped=0.0 2023-06-20 13:06:01,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-20 13:06:14,250 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:06:48,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=562284.0, ans=0.125 2023-06-20 13:07:10,607 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:07:30,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=562344.0, ans=0.09899494936611666 2023-06-20 13:07:40,844 INFO [train.py:996] (0/4) Epoch 4, batch 2250, loss[loss=0.2111, simple_loss=0.269, pruned_loss=0.07661, over 21353.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3165, pruned_loss=0.08447, over 4278671.44 frames. ], batch size: 144, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:09:51,960 INFO [train.py:996] (0/4) Epoch 4, batch 2300, loss[loss=0.301, simple_loss=0.3939, pruned_loss=0.104, over 19749.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3136, pruned_loss=0.0851, over 4276890.31 frames. ], batch size: 702, lr: 8.50e-03, grad_scale: 32.0 2023-06-20 13:10:19,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.850e+02 3.293e+02 4.156e+02 7.467e+02, threshold=6.587e+02, percent-clipped=11.0 2023-06-20 13:11:14,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-20 13:11:41,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=562944.0, ans=0.125 2023-06-20 13:11:49,249 INFO [train.py:996] (0/4) Epoch 4, batch 2350, loss[loss=0.2559, simple_loss=0.3247, pruned_loss=0.09349, over 21676.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3109, pruned_loss=0.08604, over 4278131.25 frames. ], batch size: 332, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:12:11,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=563004.0, ans=0.2 2023-06-20 13:12:18,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=563004.0, ans=0.0 2023-06-20 13:12:26,454 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:12:40,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-20 13:12:41,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=563064.0, ans=0.125 2023-06-20 13:13:06,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=563184.0, ans=0.95 2023-06-20 13:13:24,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=563244.0, ans=0.125 2023-06-20 13:13:42,851 INFO [train.py:996] (0/4) Epoch 4, batch 2400, loss[loss=0.278, simple_loss=0.3419, pruned_loss=0.1071, over 21343.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3164, pruned_loss=0.08838, over 4264437.37 frames. ], batch size: 159, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:14:05,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.727e+02 3.231e+02 4.252e+02 7.805e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-20 13:14:06,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-20 13:14:21,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=563364.0, ans=0.0 2023-06-20 13:15:27,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=563544.0, ans=0.04949747468305833 2023-06-20 13:15:50,103 INFO [train.py:996] (0/4) Epoch 4, batch 2450, loss[loss=0.251, simple_loss=0.3161, pruned_loss=0.09291, over 21778.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3219, pruned_loss=0.09138, over 4264371.20 frames. ], batch size: 124, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:16:18,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=563664.0, ans=0.2 2023-06-20 13:16:29,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=563664.0, ans=0.125 2023-06-20 13:17:01,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-20 13:17:19,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.38 vs. limit=22.5 2023-06-20 13:17:40,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=22.5 2023-06-20 13:17:51,339 INFO [train.py:996] (0/4) Epoch 4, batch 2500, loss[loss=0.2406, simple_loss=0.3239, pruned_loss=0.0786, over 21640.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3179, pruned_loss=0.08963, over 4262751.19 frames. ], batch size: 247, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:18:25,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.483e+02 2.988e+02 3.628e+02 7.218e+02, threshold=5.976e+02, percent-clipped=3.0 2023-06-20 13:18:46,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=564024.0, ans=0.09899494936611666 2023-06-20 13:19:30,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.23 vs. limit=10.0 2023-06-20 13:19:53,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=564144.0, ans=0.0 2023-06-20 13:20:01,574 INFO [train.py:996] (0/4) Epoch 4, batch 2550, loss[loss=0.2758, simple_loss=0.3677, pruned_loss=0.0919, over 21294.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3156, pruned_loss=0.08758, over 4256319.86 frames. ], batch size: 548, lr: 8.49e-03, grad_scale: 32.0 2023-06-20 13:20:02,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-20 13:20:03,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=564204.0, ans=0.09899494936611666 2023-06-20 13:20:28,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=564264.0, ans=0.0 2023-06-20 13:21:16,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=564384.0, ans=0.125 2023-06-20 13:21:49,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=564444.0, ans=0.0 2023-06-20 13:21:50,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=564444.0, ans=0.04949747468305833 2023-06-20 13:21:53,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=564444.0, ans=0.2 2023-06-20 13:22:14,096 INFO [train.py:996] (0/4) Epoch 4, batch 2600, loss[loss=0.2619, simple_loss=0.3319, pruned_loss=0.09593, over 21805.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3174, pruned_loss=0.08936, over 4264599.83 frames. ], batch size: 118, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:22:35,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.603e+02 3.043e+02 3.665e+02 5.448e+02, threshold=6.087e+02, percent-clipped=0.0 2023-06-20 13:22:37,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=564564.0, ans=0.125 2023-06-20 13:23:49,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=564744.0, ans=0.0 2023-06-20 13:24:04,594 INFO [train.py:996] (0/4) Epoch 4, batch 2650, loss[loss=0.2414, simple_loss=0.3046, pruned_loss=0.08905, over 21842.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3159, pruned_loss=0.0899, over 4275197.86 frames. ], batch size: 332, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:26:08,581 INFO [train.py:996] (0/4) Epoch 4, batch 2700, loss[loss=0.1808, simple_loss=0.2458, pruned_loss=0.05784, over 21226.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3135, pruned_loss=0.0889, over 4256674.04 frames. ], batch size: 176, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:26:51,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.726e+02 3.108e+02 3.967e+02 7.517e+02, threshold=6.217e+02, percent-clipped=3.0 2023-06-20 13:27:47,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=565284.0, ans=0.125 2023-06-20 13:28:36,563 INFO [train.py:996] (0/4) Epoch 4, batch 2750, loss[loss=0.2854, simple_loss=0.4076, pruned_loss=0.0816, over 20828.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3145, pruned_loss=0.08948, over 4251334.19 frames. ], batch size: 607, lr: 8.48e-03, grad_scale: 32.0 2023-06-20 13:29:27,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=565464.0, ans=0.2 2023-06-20 13:29:37,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=565524.0, ans=0.07 2023-06-20 13:31:01,576 INFO [train.py:996] (0/4) Epoch 4, batch 2800, loss[loss=0.284, simple_loss=0.3671, pruned_loss=0.1004, over 21785.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3202, pruned_loss=0.09167, over 4262382.79 frames. ], batch size: 332, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:31:14,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=565704.0, ans=0.2 2023-06-20 13:31:15,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-20 13:31:24,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.619e+02 3.182e+02 3.699e+02 5.789e+02, threshold=6.364e+02, percent-clipped=0.0 2023-06-20 13:31:27,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=565764.0, ans=0.125 2023-06-20 13:31:32,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=565764.0, ans=0.0 2023-06-20 13:32:33,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=565884.0, ans=0.0 2023-06-20 13:32:46,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-20 13:32:47,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-20 13:33:03,262 INFO [train.py:996] (0/4) Epoch 4, batch 2850, loss[loss=0.2441, simple_loss=0.305, pruned_loss=0.09157, over 19971.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3187, pruned_loss=0.0916, over 4260378.32 frames. ], batch size: 704, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:33:25,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=566064.0, ans=0.125 2023-06-20 13:33:27,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=566064.0, ans=0.0 2023-06-20 13:33:55,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-20 13:34:34,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=566184.0, ans=0.0 2023-06-20 13:34:56,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=566244.0, ans=0.2 2023-06-20 13:35:05,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=566244.0, ans=15.0 2023-06-20 13:35:09,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=566244.0, ans=0.0 2023-06-20 13:35:20,758 INFO [train.py:996] (0/4) Epoch 4, batch 2900, loss[loss=0.3665, simple_loss=0.4374, pruned_loss=0.1478, over 21624.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.315, pruned_loss=0.09025, over 4260586.15 frames. ], batch size: 441, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:35:21,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=566304.0, ans=0.0 2023-06-20 13:35:43,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.738e+02 3.167e+02 3.892e+02 8.808e+02, threshold=6.333e+02, percent-clipped=7.0 2023-06-20 13:35:50,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-20 13:36:19,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566424.0, ans=0.1 2023-06-20 13:37:03,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=566484.0, ans=0.125 2023-06-20 13:37:13,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-20 13:37:31,887 INFO [train.py:996] (0/4) Epoch 4, batch 2950, loss[loss=0.2433, simple_loss=0.2901, pruned_loss=0.09823, over 20200.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3179, pruned_loss=0.09083, over 4267383.34 frames. ], batch size: 703, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:37:53,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566604.0, ans=0.1 2023-06-20 13:38:18,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.43 vs. limit=22.5 2023-06-20 13:38:20,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=566664.0, ans=0.0 2023-06-20 13:39:31,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=566784.0, ans=0.125 2023-06-20 13:39:55,844 INFO [train.py:996] (0/4) Epoch 4, batch 3000, loss[loss=0.2621, simple_loss=0.3359, pruned_loss=0.09415, over 21796.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3236, pruned_loss=0.09243, over 4270962.42 frames. ], batch size: 332, lr: 8.47e-03, grad_scale: 32.0 2023-06-20 13:39:55,846 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 13:40:43,512 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2581, simple_loss=0.352, pruned_loss=0.08208, over 1796401.00 frames. 2023-06-20 13:40:43,513 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 13:41:00,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.790e+02 3.201e+02 3.672e+02 6.689e+02, threshold=6.402e+02, percent-clipped=1.0 2023-06-20 13:41:13,982 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:41:13,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566964.0, ans=0.1 2023-06-20 13:42:00,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-06-20 13:42:39,630 INFO [train.py:996] (0/4) Epoch 4, batch 3050, loss[loss=0.1881, simple_loss=0.2793, pruned_loss=0.04848, over 21489.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.324, pruned_loss=0.09071, over 4278147.13 frames. ], batch size: 211, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:43:32,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=567324.0, ans=0.125 2023-06-20 13:44:38,267 INFO [train.py:996] (0/4) Epoch 4, batch 3100, loss[loss=0.2246, simple_loss=0.3214, pruned_loss=0.06394, over 21681.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3234, pruned_loss=0.08995, over 4278298.85 frames. ], batch size: 298, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:44:46,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=567504.0, ans=0.0 2023-06-20 13:44:57,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=567564.0, ans=0.0 2023-06-20 13:45:06,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.520e+02 2.947e+02 3.627e+02 6.337e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-20 13:46:27,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-20 13:46:34,680 INFO [train.py:996] (0/4) Epoch 4, batch 3150, loss[loss=0.2454, simple_loss=0.3191, pruned_loss=0.08584, over 21405.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3256, pruned_loss=0.09062, over 4276100.83 frames. ], batch size: 211, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:47:00,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=567864.0, ans=0.04949747468305833 2023-06-20 13:47:04,964 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:47:08,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=567864.0, ans=0.04949747468305833 2023-06-20 13:47:08,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=567864.0, ans=0.0 2023-06-20 13:47:13,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=567864.0, ans=0.015 2023-06-20 13:47:13,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=567864.0, ans=0.125 2023-06-20 13:47:20,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=567924.0, ans=0.05 2023-06-20 13:47:52,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=567984.0, ans=0.1 2023-06-20 13:48:13,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=567984.0, ans=0.0 2023-06-20 13:48:22,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=568044.0, ans=0.125 2023-06-20 13:48:40,621 INFO [train.py:996] (0/4) Epoch 4, batch 3200, loss[loss=0.2426, simple_loss=0.3257, pruned_loss=0.07978, over 21800.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3266, pruned_loss=0.09038, over 4279923.90 frames. ], batch size: 332, lr: 8.46e-03, grad_scale: 32.0 2023-06-20 13:49:01,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=568164.0, ans=0.2 2023-06-20 13:49:09,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.444e+02 2.825e+02 3.335e+02 7.265e+02, threshold=5.650e+02, percent-clipped=2.0 2023-06-20 13:49:27,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=568164.0, ans=0.0 2023-06-20 13:49:49,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=568224.0, ans=0.015 2023-06-20 13:50:20,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=568284.0, ans=0.1 2023-06-20 13:50:22,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=568284.0, ans=0.2 2023-06-20 13:50:37,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-20 13:50:45,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=568344.0, ans=0.0 2023-06-20 13:50:50,903 INFO [train.py:996] (0/4) Epoch 4, batch 3250, loss[loss=0.2739, simple_loss=0.3191, pruned_loss=0.1143, over 21539.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3244, pruned_loss=0.09162, over 4279803.20 frames. ], batch size: 441, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:51:04,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=568404.0, ans=0.125 2023-06-20 13:51:15,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=568464.0, ans=0.0 2023-06-20 13:51:22,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=568464.0, ans=0.125 2023-06-20 13:51:33,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=568524.0, ans=0.125 2023-06-20 13:51:41,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568524.0, ans=0.1 2023-06-20 13:51:46,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=568524.0, ans=0.125 2023-06-20 13:51:52,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=568584.0, ans=0.1 2023-06-20 13:52:02,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=568644.0, ans=0.125 2023-06-20 13:52:24,887 INFO [train.py:996] (0/4) Epoch 4, batch 3300, loss[loss=0.2191, simple_loss=0.2839, pruned_loss=0.07713, over 21896.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3189, pruned_loss=0.09167, over 4276920.68 frames. ], batch size: 125, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:52:34,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=568704.0, ans=0.04949747468305833 2023-06-20 13:52:41,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=568704.0, ans=0.2 2023-06-20 13:52:51,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.805e+02 3.280e+02 3.867e+02 6.254e+02, threshold=6.560e+02, percent-clipped=3.0 2023-06-20 13:53:30,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-20 13:54:16,870 INFO [train.py:996] (0/4) Epoch 4, batch 3350, loss[loss=0.262, simple_loss=0.3339, pruned_loss=0.09502, over 21779.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3221, pruned_loss=0.09164, over 4282350.38 frames. ], batch size: 124, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:54:20,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=569004.0, ans=0.125 2023-06-20 13:54:25,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=569004.0, ans=10.0 2023-06-20 13:54:48,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569064.0, ans=0.125 2023-06-20 13:54:59,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=569124.0, ans=0.125 2023-06-20 13:55:09,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=569184.0, ans=0.04949747468305833 2023-06-20 13:55:09,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-20 13:55:11,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-20 13:55:28,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=569184.0, ans=0.1 2023-06-20 13:56:09,593 INFO [train.py:996] (0/4) Epoch 4, batch 3400, loss[loss=0.2359, simple_loss=0.3012, pruned_loss=0.08526, over 21828.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.322, pruned_loss=0.09214, over 4284454.15 frames. ], batch size: 107, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:56:35,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.736e+02 3.073e+02 3.574e+02 7.893e+02, threshold=6.146e+02, percent-clipped=1.0 2023-06-20 13:56:42,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-20 13:56:43,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-20 13:56:46,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=569424.0, ans=0.2 2023-06-20 13:57:29,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=569484.0, ans=0.015 2023-06-20 13:57:32,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=569544.0, ans=0.125 2023-06-20 13:57:34,007 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:57:34,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=569544.0, ans=0.125 2023-06-20 13:57:53,504 INFO [train.py:996] (0/4) Epoch 4, batch 3450, loss[loss=0.2166, simple_loss=0.2848, pruned_loss=0.07426, over 21558.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3173, pruned_loss=0.09074, over 4280517.29 frames. ], batch size: 263, lr: 8.45e-03, grad_scale: 16.0 2023-06-20 13:58:16,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=569664.0, ans=0.2 2023-06-20 13:58:30,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-20 13:59:51,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-20 13:59:52,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=569844.0, ans=0.125 2023-06-20 14:00:02,853 INFO [train.py:996] (0/4) Epoch 4, batch 3500, loss[loss=0.2397, simple_loss=0.2976, pruned_loss=0.09087, over 21220.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3263, pruned_loss=0.0939, over 4284354.89 frames. ], batch size: 608, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:00:20,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=569904.0, ans=0.125 2023-06-20 14:00:31,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=569964.0, ans=0.125 2023-06-20 14:00:34,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.895e+02 3.377e+02 4.214e+02 8.364e+02, threshold=6.755e+02, percent-clipped=8.0 2023-06-20 14:00:49,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=570024.0, ans=0.0 2023-06-20 14:01:30,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-20 14:02:02,771 INFO [train.py:996] (0/4) Epoch 4, batch 3550, loss[loss=0.2401, simple_loss=0.3141, pruned_loss=0.08306, over 21635.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3298, pruned_loss=0.0948, over 4278564.56 frames. ], batch size: 332, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:03:31,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=570384.0, ans=0.5 2023-06-20 14:03:59,266 INFO [train.py:996] (0/4) Epoch 4, batch 3600, loss[loss=0.2257, simple_loss=0.2988, pruned_loss=0.0763, over 20050.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3236, pruned_loss=0.09397, over 4279077.89 frames. ], batch size: 702, lr: 8.44e-03, grad_scale: 32.0 2023-06-20 14:04:00,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-20 14:04:11,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=570504.0, ans=0.09899494936611666 2023-06-20 14:04:25,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.783e+02 3.214e+02 3.773e+02 6.590e+02, threshold=6.428e+02, percent-clipped=0.0 2023-06-20 14:05:14,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570684.0, ans=0.1 2023-06-20 14:05:55,921 INFO [train.py:996] (0/4) Epoch 4, batch 3650, loss[loss=0.2235, simple_loss=0.2941, pruned_loss=0.07641, over 21633.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3255, pruned_loss=0.09418, over 4280836.08 frames. ], batch size: 230, lr: 8.44e-03, grad_scale: 16.0 2023-06-20 14:06:23,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=570864.0, ans=0.0 2023-06-20 14:06:50,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=570924.0, ans=0.95 2023-06-20 14:07:27,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=571044.0, ans=0.125 2023-06-20 14:08:03,202 INFO [train.py:996] (0/4) Epoch 4, batch 3700, loss[loss=0.2463, simple_loss=0.3157, pruned_loss=0.08847, over 21884.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3229, pruned_loss=0.09287, over 4282463.17 frames. ], batch size: 371, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:08:41,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.519e+02 3.000e+02 3.598e+02 6.843e+02, threshold=6.000e+02, percent-clipped=1.0 2023-06-20 14:10:13,611 INFO [train.py:996] (0/4) Epoch 4, batch 3750, loss[loss=0.1944, simple_loss=0.2696, pruned_loss=0.0596, over 21403.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3212, pruned_loss=0.09222, over 4280296.00 frames. ], batch size: 194, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:11:08,805 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:11:20,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=571584.0, ans=0.2 2023-06-20 14:11:38,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 14:12:13,304 INFO [train.py:996] (0/4) Epoch 4, batch 3800, loss[loss=0.2155, simple_loss=0.3246, pruned_loss=0.05324, over 20786.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3195, pruned_loss=0.09026, over 4276620.79 frames. ], batch size: 608, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:12:14,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=22.5 2023-06-20 14:12:45,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.431e+02 2.892e+02 3.588e+02 7.450e+02, threshold=5.785e+02, percent-clipped=3.0 2023-06-20 14:13:03,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-20 14:13:23,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=571884.0, ans=0.0 2023-06-20 14:13:50,974 INFO [train.py:996] (0/4) Epoch 4, batch 3850, loss[loss=0.2101, simple_loss=0.2751, pruned_loss=0.07258, over 21719.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3166, pruned_loss=0.0901, over 4275813.82 frames. ], batch size: 112, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:14:48,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.94 vs. limit=12.0 2023-06-20 14:15:15,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=572184.0, ans=0.0 2023-06-20 14:15:31,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=572244.0, ans=0.2 2023-06-20 14:15:38,373 INFO [train.py:996] (0/4) Epoch 4, batch 3900, loss[loss=0.2302, simple_loss=0.2965, pruned_loss=0.08189, over 21853.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3124, pruned_loss=0.08971, over 4266568.63 frames. ], batch size: 107, lr: 8.43e-03, grad_scale: 16.0 2023-06-20 14:16:08,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=572304.0, ans=0.0 2023-06-20 14:16:17,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.553e+02 2.993e+02 3.640e+02 7.151e+02, threshold=5.987e+02, percent-clipped=1.0 2023-06-20 14:16:44,909 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:17:39,778 INFO [train.py:996] (0/4) Epoch 4, batch 3950, loss[loss=0.2011, simple_loss=0.3145, pruned_loss=0.04386, over 19796.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3132, pruned_loss=0.08826, over 4268760.00 frames. ], batch size: 702, lr: 8.42e-03, grad_scale: 16.0 2023-06-20 14:19:47,307 INFO [train.py:996] (0/4) Epoch 4, batch 4000, loss[loss=0.1895, simple_loss=0.2506, pruned_loss=0.06418, over 21298.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3066, pruned_loss=0.08484, over 4274465.79 frames. ], batch size: 177, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:20:02,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=572964.0, ans=0.2 2023-06-20 14:20:13,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.382e+02 2.703e+02 3.231e+02 4.558e+02, threshold=5.407e+02, percent-clipped=0.0 2023-06-20 14:20:15,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-20 14:20:23,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=573024.0, ans=0.125 2023-06-20 14:20:23,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=573024.0, ans=0.0 2023-06-20 14:20:56,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=573144.0, ans=0.2 2023-06-20 14:21:28,561 INFO [train.py:996] (0/4) Epoch 4, batch 4050, loss[loss=0.2179, simple_loss=0.2895, pruned_loss=0.07314, over 21147.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3059, pruned_loss=0.08291, over 4270541.92 frames. ], batch size: 143, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:22:21,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-20 14:22:23,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=573324.0, ans=0.125 2023-06-20 14:22:34,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=573324.0, ans=0.0 2023-06-20 14:22:37,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-20 14:23:32,508 INFO [train.py:996] (0/4) Epoch 4, batch 4100, loss[loss=0.2636, simple_loss=0.3337, pruned_loss=0.09679, over 21338.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3095, pruned_loss=0.08498, over 4279099.29 frames. ], batch size: 548, lr: 8.42e-03, grad_scale: 32.0 2023-06-20 14:24:14,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.729e+02 3.089e+02 3.698e+02 6.441e+02, threshold=6.178e+02, percent-clipped=7.0 2023-06-20 14:24:40,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=573624.0, ans=0.035 2023-06-20 14:25:28,424 INFO [train.py:996] (0/4) Epoch 4, batch 4150, loss[loss=0.2232, simple_loss=0.275, pruned_loss=0.08566, over 21263.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3107, pruned_loss=0.08313, over 4273203.51 frames. ], batch size: 548, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:26:12,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-20 14:26:38,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-06-20 14:27:29,743 INFO [train.py:996] (0/4) Epoch 4, batch 4200, loss[loss=0.2369, simple_loss=0.3047, pruned_loss=0.08455, over 21629.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3105, pruned_loss=0.08325, over 4262938.93 frames. ], batch size: 247, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:27:42,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=15.0 2023-06-20 14:27:45,405 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:27:45,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=574164.0, ans=0.125 2023-06-20 14:27:56,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 2.492e+02 2.868e+02 3.338e+02 5.959e+02, threshold=5.736e+02, percent-clipped=0.0 2023-06-20 14:28:05,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=574224.0, ans=0.125 2023-06-20 14:28:15,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-20 14:29:06,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-20 14:29:22,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=574404.0, ans=0.1 2023-06-20 14:29:23,354 INFO [train.py:996] (0/4) Epoch 4, batch 4250, loss[loss=0.2294, simple_loss=0.3044, pruned_loss=0.07715, over 19983.00 frames. ], tot_loss[loss=0.245, simple_loss=0.319, pruned_loss=0.08552, over 4264940.48 frames. ], batch size: 702, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:29:31,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=574404.0, ans=0.125 2023-06-20 14:29:32,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-20 14:29:48,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 14:31:00,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=574584.0, ans=0.5 2023-06-20 14:31:03,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=574584.0, ans=0.125 2023-06-20 14:31:33,584 INFO [train.py:996] (0/4) Epoch 4, batch 4300, loss[loss=0.3284, simple_loss=0.4097, pruned_loss=0.1236, over 21414.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3255, pruned_loss=0.08845, over 4266136.33 frames. ], batch size: 507, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:32:18,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.663e+02 3.157e+02 3.844e+02 7.898e+02, threshold=6.314e+02, percent-clipped=3.0 2023-06-20 14:32:35,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=574824.0, ans=0.04949747468305833 2023-06-20 14:33:24,419 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:33:35,740 INFO [train.py:996] (0/4) Epoch 4, batch 4350, loss[loss=0.2566, simple_loss=0.3128, pruned_loss=0.1002, over 21449.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.325, pruned_loss=0.08814, over 4267809.36 frames. ], batch size: 389, lr: 8.41e-03, grad_scale: 32.0 2023-06-20 14:33:44,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-20 14:34:01,228 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:34:10,625 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:34:11,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 14:35:09,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=575184.0, ans=0.5 2023-06-20 14:35:26,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-20 14:35:27,076 INFO [train.py:996] (0/4) Epoch 4, batch 4400, loss[loss=0.288, simple_loss=0.3584, pruned_loss=0.1088, over 21370.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3225, pruned_loss=0.08789, over 4263560.01 frames. ], batch size: 549, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:35:33,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=575304.0, ans=0.0 2023-06-20 14:36:06,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.625e+02 3.087e+02 3.539e+02 6.162e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-20 14:37:08,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=575544.0, ans=0.0 2023-06-20 14:37:32,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 14:37:37,827 INFO [train.py:996] (0/4) Epoch 4, batch 4450, loss[loss=0.2331, simple_loss=0.3285, pruned_loss=0.06884, over 21581.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.329, pruned_loss=0.08876, over 4270367.96 frames. ], batch size: 230, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:37:38,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=575604.0, ans=0.125 2023-06-20 14:38:04,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=575664.0, ans=0.0 2023-06-20 14:38:30,990 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:39:35,194 INFO [train.py:996] (0/4) Epoch 4, batch 4500, loss[loss=0.2481, simple_loss=0.3278, pruned_loss=0.08423, over 21235.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3303, pruned_loss=0.09023, over 4272406.84 frames. ], batch size: 159, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:39:37,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=575904.0, ans=0.125 2023-06-20 14:39:59,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=575904.0, ans=0.0 2023-06-20 14:40:07,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.623e+02 2.937e+02 3.558e+02 5.301e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-20 14:40:14,813 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-96000.pt 2023-06-20 14:40:55,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-20 14:41:22,673 INFO [train.py:996] (0/4) Epoch 4, batch 4550, loss[loss=0.319, simple_loss=0.3796, pruned_loss=0.1292, over 21402.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3337, pruned_loss=0.09078, over 4275344.90 frames. ], batch size: 471, lr: 8.40e-03, grad_scale: 32.0 2023-06-20 14:42:33,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-20 14:42:50,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=576384.0, ans=22.5 2023-06-20 14:43:23,272 INFO [train.py:996] (0/4) Epoch 4, batch 4600, loss[loss=0.23, simple_loss=0.2815, pruned_loss=0.08924, over 21165.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3344, pruned_loss=0.09288, over 4276434.86 frames. ], batch size: 608, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:43:49,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.820e+02 3.441e+02 4.090e+02 6.361e+02, threshold=6.882e+02, percent-clipped=1.0 2023-06-20 14:44:57,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=576744.0, ans=0.0 2023-06-20 14:44:58,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=576744.0, ans=0.125 2023-06-20 14:45:01,218 INFO [train.py:996] (0/4) Epoch 4, batch 4650, loss[loss=0.2277, simple_loss=0.3033, pruned_loss=0.07605, over 21866.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3271, pruned_loss=0.09097, over 4282695.00 frames. ], batch size: 332, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:45:01,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=576804.0, ans=0.125 2023-06-20 14:45:34,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=576864.0, ans=0.07 2023-06-20 14:45:41,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=576924.0, ans=10.0 2023-06-20 14:45:46,506 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:46:06,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=576984.0, ans=0.0 2023-06-20 14:46:20,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=577044.0, ans=0.125 2023-06-20 14:46:30,086 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:46:33,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=8.0 2023-06-20 14:46:36,960 INFO [train.py:996] (0/4) Epoch 4, batch 4700, loss[loss=0.2293, simple_loss=0.2949, pruned_loss=0.08181, over 21860.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3176, pruned_loss=0.08728, over 4278977.22 frames. ], batch size: 107, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:47:09,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.385e+02 2.859e+02 3.485e+02 5.797e+02, threshold=5.718e+02, percent-clipped=0.0 2023-06-20 14:47:33,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=577284.0, ans=0.0 2023-06-20 14:47:54,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=577344.0, ans=0.0 2023-06-20 14:48:14,347 INFO [train.py:996] (0/4) Epoch 4, batch 4750, loss[loss=0.2488, simple_loss=0.3015, pruned_loss=0.09806, over 21566.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3121, pruned_loss=0.08771, over 4274447.70 frames. ], batch size: 548, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:48:14,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=577404.0, ans=0.125 2023-06-20 14:48:31,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=577404.0, ans=0.0 2023-06-20 14:48:51,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-20 14:48:58,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=577524.0, ans=0.125 2023-06-20 14:49:04,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=577524.0, ans=0.0 2023-06-20 14:49:06,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=577524.0, ans=0.125 2023-06-20 14:49:30,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=577584.0, ans=0.07 2023-06-20 14:50:08,455 INFO [train.py:996] (0/4) Epoch 4, batch 4800, loss[loss=0.3076, simple_loss=0.3957, pruned_loss=0.1098, over 21527.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3128, pruned_loss=0.08821, over 4283568.84 frames. ], batch size: 471, lr: 8.39e-03, grad_scale: 32.0 2023-06-20 14:50:45,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.623e+02 2.992e+02 3.457e+02 8.061e+02, threshold=5.984e+02, percent-clipped=2.0 2023-06-20 14:50:51,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=577764.0, ans=0.0 2023-06-20 14:50:52,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-20 14:51:18,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=577824.0, ans=0.125 2023-06-20 14:51:48,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=577944.0, ans=0.125 2023-06-20 14:52:02,775 INFO [train.py:996] (0/4) Epoch 4, batch 4850, loss[loss=0.2325, simple_loss=0.3009, pruned_loss=0.08211, over 21636.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3111, pruned_loss=0.08739, over 4279922.77 frames. ], batch size: 230, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:52:16,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=578004.0, ans=0.0 2023-06-20 14:52:34,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=12.0 2023-06-20 14:52:46,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=578124.0, ans=0.125 2023-06-20 14:53:40,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578304.0, ans=0.1 2023-06-20 14:53:41,183 INFO [train.py:996] (0/4) Epoch 4, batch 4900, loss[loss=0.2812, simple_loss=0.3567, pruned_loss=0.1028, over 21744.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3136, pruned_loss=0.089, over 4289665.46 frames. ], batch size: 441, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:53:53,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=578304.0, ans=0.2 2023-06-20 14:54:13,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.490e+02 2.860e+02 3.460e+02 5.530e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-20 14:54:18,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=578364.0, ans=0.0 2023-06-20 14:54:22,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-20 14:54:26,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=578424.0, ans=0.125 2023-06-20 14:55:08,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=578544.0, ans=0.125 2023-06-20 14:55:16,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-20 14:55:27,080 INFO [train.py:996] (0/4) Epoch 4, batch 4950, loss[loss=0.2, simple_loss=0.2729, pruned_loss=0.06353, over 21293.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3173, pruned_loss=0.08735, over 4284883.71 frames. ], batch size: 144, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:55:32,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=578604.0, ans=0.05 2023-06-20 14:55:51,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=578664.0, ans=0.125 2023-06-20 14:56:19,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=578724.0, ans=0.125 2023-06-20 14:56:26,329 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-20 14:57:04,932 INFO [train.py:996] (0/4) Epoch 4, batch 5000, loss[loss=0.2626, simple_loss=0.3334, pruned_loss=0.09591, over 21855.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3171, pruned_loss=0.08328, over 4286967.10 frames. ], batch size: 414, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:57:05,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=578904.0, ans=0.04949747468305833 2023-06-20 14:57:27,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=578964.0, ans=0.2 2023-06-20 14:57:31,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.437e+02 2.852e+02 3.653e+02 5.289e+02, threshold=5.704e+02, percent-clipped=0.0 2023-06-20 14:58:01,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=579084.0, ans=0.125 2023-06-20 14:58:36,501 INFO [train.py:996] (0/4) Epoch 4, batch 5050, loss[loss=0.2251, simple_loss=0.2947, pruned_loss=0.07776, over 15254.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3184, pruned_loss=0.08483, over 4284052.68 frames. ], batch size: 61, lr: 8.38e-03, grad_scale: 32.0 2023-06-20 14:58:36,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=579204.0, ans=0.125 2023-06-20 14:59:19,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=579324.0, ans=0.125 2023-06-20 14:59:49,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=579384.0, ans=0.125 2023-06-20 15:00:18,919 INFO [train.py:996] (0/4) Epoch 4, batch 5100, loss[loss=0.2221, simple_loss=0.2911, pruned_loss=0.07659, over 21778.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3172, pruned_loss=0.08516, over 4291416.45 frames. ], batch size: 247, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:00:22,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=579504.0, ans=0.2 2023-06-20 15:00:28,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=579504.0, ans=0.125 2023-06-20 15:00:43,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-20 15:00:45,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=22.5 2023-06-20 15:00:45,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.522e+02 2.846e+02 3.197e+02 5.068e+02, threshold=5.691e+02, percent-clipped=0.0 2023-06-20 15:01:15,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=579684.0, ans=0.0 2023-06-20 15:01:37,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=579744.0, ans=15.0 2023-06-20 15:01:57,142 INFO [train.py:996] (0/4) Epoch 4, batch 5150, loss[loss=0.2258, simple_loss=0.2992, pruned_loss=0.07619, over 21837.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3149, pruned_loss=0.08556, over 4296450.84 frames. ], batch size: 298, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:02:01,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-20 15:02:25,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-20 15:02:32,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=579924.0, ans=0.125 2023-06-20 15:02:59,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=579984.0, ans=0.1 2023-06-20 15:03:12,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-20 15:03:35,258 INFO [train.py:996] (0/4) Epoch 4, batch 5200, loss[loss=0.2507, simple_loss=0.3452, pruned_loss=0.0781, over 21776.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3147, pruned_loss=0.08601, over 4296347.89 frames. ], batch size: 351, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:04:01,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.678e+02 3.286e+02 3.852e+02 6.243e+02, threshold=6.571e+02, percent-clipped=1.0 2023-06-20 15:05:21,824 INFO [train.py:996] (0/4) Epoch 4, batch 5250, loss[loss=0.2542, simple_loss=0.3319, pruned_loss=0.08829, over 21833.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3192, pruned_loss=0.08497, over 4297307.48 frames. ], batch size: 316, lr: 8.37e-03, grad_scale: 32.0 2023-06-20 15:05:44,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=15.0 2023-06-20 15:05:47,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=580464.0, ans=0.125 2023-06-20 15:05:53,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=580464.0, ans=0.2 2023-06-20 15:06:01,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-20 15:06:16,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=580584.0, ans=0.1 2023-06-20 15:06:16,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=580584.0, ans=0.2 2023-06-20 15:06:48,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=580644.0, ans=0.125 2023-06-20 15:06:48,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=580644.0, ans=0.125 2023-06-20 15:06:59,111 INFO [train.py:996] (0/4) Epoch 4, batch 5300, loss[loss=0.2814, simple_loss=0.4051, pruned_loss=0.07888, over 19806.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3191, pruned_loss=0.08534, over 4299922.24 frames. ], batch size: 702, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:07:05,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=580704.0, ans=0.125 2023-06-20 15:07:19,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-20 15:07:24,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=580764.0, ans=0.015 2023-06-20 15:07:25,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.445e+02 2.866e+02 3.418e+02 4.898e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-20 15:07:26,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=580764.0, ans=0.125 2023-06-20 15:07:51,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=580824.0, ans=0.05 2023-06-20 15:08:35,819 INFO [train.py:996] (0/4) Epoch 4, batch 5350, loss[loss=0.2373, simple_loss=0.2981, pruned_loss=0.08823, over 21690.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3195, pruned_loss=0.08765, over 4304507.97 frames. ], batch size: 230, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:09:25,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=581124.0, ans=0.2 2023-06-20 15:10:13,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=581244.0, ans=0.5 2023-06-20 15:10:26,849 INFO [train.py:996] (0/4) Epoch 4, batch 5400, loss[loss=0.2157, simple_loss=0.2939, pruned_loss=0.06875, over 21791.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3172, pruned_loss=0.08874, over 4312749.98 frames. ], batch size: 298, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:10:59,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 2.501e+02 3.019e+02 3.777e+02 8.074e+02, threshold=6.038e+02, percent-clipped=3.0 2023-06-20 15:11:02,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=581364.0, ans=0.2 2023-06-20 15:11:46,453 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:12:18,265 INFO [train.py:996] (0/4) Epoch 4, batch 5450, loss[loss=0.2702, simple_loss=0.3802, pruned_loss=0.08008, over 21779.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.315, pruned_loss=0.08617, over 4310275.93 frames. ], batch size: 351, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:12:34,676 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:12:55,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=581664.0, ans=0.2 2023-06-20 15:13:00,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=581724.0, ans=0.125 2023-06-20 15:13:36,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-20 15:13:56,314 INFO [train.py:996] (0/4) Epoch 4, batch 5500, loss[loss=0.197, simple_loss=0.2727, pruned_loss=0.06064, over 21895.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3191, pruned_loss=0.08307, over 4305838.77 frames. ], batch size: 98, lr: 8.36e-03, grad_scale: 32.0 2023-06-20 15:14:28,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 2.356e+02 2.658e+02 3.357e+02 7.374e+02, threshold=5.315e+02, percent-clipped=2.0 2023-06-20 15:14:52,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=582024.0, ans=0.95 2023-06-20 15:15:01,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-20 15:15:21,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-20 15:15:41,308 INFO [train.py:996] (0/4) Epoch 4, batch 5550, loss[loss=0.2149, simple_loss=0.3091, pruned_loss=0.06036, over 21687.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3182, pruned_loss=0.07962, over 4296388.27 frames. ], batch size: 263, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:16:29,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=582324.0, ans=0.125 2023-06-20 15:16:29,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=582324.0, ans=0.125 2023-06-20 15:16:41,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582324.0, ans=0.125 2023-06-20 15:17:26,386 INFO [train.py:996] (0/4) Epoch 4, batch 5600, loss[loss=0.3066, simple_loss=0.4224, pruned_loss=0.09544, over 19811.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3174, pruned_loss=0.07745, over 4292956.49 frames. ], batch size: 702, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:17:44,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=582504.0, ans=0.0 2023-06-20 15:17:54,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.189e+02 2.611e+02 3.161e+02 5.286e+02, threshold=5.221e+02, percent-clipped=0.0 2023-06-20 15:18:02,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=582624.0, ans=0.125 2023-06-20 15:18:05,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=582624.0, ans=0.0 2023-06-20 15:18:17,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=582624.0, ans=0.125 2023-06-20 15:19:00,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=582744.0, ans=0.0 2023-06-20 15:19:02,455 INFO [train.py:996] (0/4) Epoch 4, batch 5650, loss[loss=0.2674, simple_loss=0.3324, pruned_loss=0.1012, over 21754.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3214, pruned_loss=0.07996, over 4294971.09 frames. ], batch size: 441, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:19:32,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=582864.0, ans=0.2 2023-06-20 15:19:33,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=582864.0, ans=0.125 2023-06-20 15:19:47,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=582924.0, ans=0.2 2023-06-20 15:20:04,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-20 15:20:06,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=582984.0, ans=0.04949747468305833 2023-06-20 15:20:44,352 INFO [train.py:996] (0/4) Epoch 4, batch 5700, loss[loss=0.2176, simple_loss=0.3026, pruned_loss=0.0663, over 21652.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3203, pruned_loss=0.08179, over 4285802.30 frames. ], batch size: 263, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:20:44,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=583104.0, ans=0.0 2023-06-20 15:21:12,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.371e+02 2.953e+02 3.348e+02 5.178e+02, threshold=5.907e+02, percent-clipped=0.0 2023-06-20 15:21:13,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=583164.0, ans=0.0 2023-06-20 15:21:15,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=583164.0, ans=0.2 2023-06-20 15:21:40,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-20 15:22:11,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=583284.0, ans=0.125 2023-06-20 15:22:33,077 INFO [train.py:996] (0/4) Epoch 4, batch 5750, loss[loss=0.2674, simple_loss=0.3192, pruned_loss=0.1079, over 20040.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3145, pruned_loss=0.07811, over 4285116.74 frames. ], batch size: 702, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 15:23:12,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=583464.0, ans=0.125 2023-06-20 15:23:20,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-20 15:23:37,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=583584.0, ans=0.125 2023-06-20 15:24:12,029 INFO [train.py:996] (0/4) Epoch 4, batch 5800, loss[loss=0.2274, simple_loss=0.3022, pruned_loss=0.07627, over 21179.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3137, pruned_loss=0.07673, over 4281788.97 frames. ], batch size: 143, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:24:46,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.366e+02 2.813e+02 3.634e+02 6.586e+02, threshold=5.626e+02, percent-clipped=4.0 2023-06-20 15:24:52,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=583824.0, ans=0.125 2023-06-20 15:25:41,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=583944.0, ans=0.04949747468305833 2023-06-20 15:26:02,443 INFO [train.py:996] (0/4) Epoch 4, batch 5850, loss[loss=0.1762, simple_loss=0.275, pruned_loss=0.03872, over 21287.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.309, pruned_loss=0.07226, over 4284121.54 frames. ], batch size: 176, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:26:42,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=584124.0, ans=0.125 2023-06-20 15:26:58,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=584184.0, ans=0.125 2023-06-20 15:26:59,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-20 15:27:01,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=584184.0, ans=0.1 2023-06-20 15:27:05,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584184.0, ans=0.1 2023-06-20 15:27:12,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=584184.0, ans=0.125 2023-06-20 15:27:39,552 INFO [train.py:996] (0/4) Epoch 4, batch 5900, loss[loss=0.1979, simple_loss=0.273, pruned_loss=0.06135, over 21841.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3028, pruned_loss=0.06834, over 4286111.54 frames. ], batch size: 282, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:27:45,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=584304.0, ans=0.125 2023-06-20 15:27:57,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=584364.0, ans=0.0 2023-06-20 15:28:10,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 2.239e+02 2.663e+02 3.390e+02 4.720e+02, threshold=5.325e+02, percent-clipped=0.0 2023-06-20 15:29:22,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=584544.0, ans=0.2 2023-06-20 15:29:26,769 INFO [train.py:996] (0/4) Epoch 4, batch 5950, loss[loss=0.1955, simple_loss=0.2751, pruned_loss=0.05793, over 21649.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3031, pruned_loss=0.07137, over 4276928.66 frames. ], batch size: 230, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 15:30:08,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584724.0, ans=0.1 2023-06-20 15:30:14,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=584724.0, ans=0.125 2023-06-20 15:30:50,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=15.0 2023-06-20 15:31:00,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=584844.0, ans=0.125 2023-06-20 15:31:13,852 INFO [train.py:996] (0/4) Epoch 4, batch 6000, loss[loss=0.2517, simple_loss=0.2923, pruned_loss=0.1055, over 21320.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3012, pruned_loss=0.07511, over 4274620.09 frames. ], batch size: 473, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:31:13,854 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 15:32:15,028 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.1320, 2.7214, 4.0578, 2.5115], device='cuda:0') 2023-06-20 15:32:19,088 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3595, pruned_loss=0.08138, over 1796401.00 frames. 2023-06-20 15:32:19,099 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 15:32:52,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.636e+02 3.112e+02 3.720e+02 8.461e+02, threshold=6.223e+02, percent-clipped=3.0 2023-06-20 15:33:39,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=585084.0, ans=0.0 2023-06-20 15:33:59,944 INFO [train.py:996] (0/4) Epoch 4, batch 6050, loss[loss=0.2578, simple_loss=0.3085, pruned_loss=0.1036, over 21803.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2985, pruned_loss=0.07721, over 4272867.47 frames. ], batch size: 107, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:34:58,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=585324.0, ans=0.125 2023-06-20 15:35:00,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=585324.0, ans=0.2 2023-06-20 15:35:33,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=585384.0, ans=0.125 2023-06-20 15:35:48,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585444.0, ans=0.1 2023-06-20 15:35:54,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=585444.0, ans=0.125 2023-06-20 15:36:02,797 INFO [train.py:996] (0/4) Epoch 4, batch 6100, loss[loss=0.1832, simple_loss=0.2747, pruned_loss=0.04587, over 21656.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2969, pruned_loss=0.07546, over 4274592.32 frames. ], batch size: 263, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:36:43,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.490e+02 2.900e+02 3.284e+02 4.880e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-20 15:37:40,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=585684.0, ans=0.0 2023-06-20 15:37:50,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-20 15:38:11,425 INFO [train.py:996] (0/4) Epoch 4, batch 6150, loss[loss=0.215, simple_loss=0.2825, pruned_loss=0.07374, over 21116.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3004, pruned_loss=0.07928, over 4275974.03 frames. ], batch size: 159, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:39:28,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=585984.0, ans=0.1 2023-06-20 15:39:34,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=586044.0, ans=0.0 2023-06-20 15:40:10,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=586104.0, ans=0.0 2023-06-20 15:40:11,442 INFO [train.py:996] (0/4) Epoch 4, batch 6200, loss[loss=0.2272, simple_loss=0.2985, pruned_loss=0.07796, over 21290.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3021, pruned_loss=0.0794, over 4273851.81 frames. ], batch size: 143, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 15:40:23,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=586104.0, ans=0.125 2023-06-20 15:40:29,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=586104.0, ans=0.125 2023-06-20 15:40:39,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.393e+02 2.716e+02 3.174e+02 4.783e+02, threshold=5.432e+02, percent-clipped=0.0 2023-06-20 15:40:44,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=586164.0, ans=0.2 2023-06-20 15:42:07,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-20 15:42:09,653 INFO [train.py:996] (0/4) Epoch 4, batch 6250, loss[loss=0.231, simple_loss=0.3266, pruned_loss=0.06769, over 21618.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3079, pruned_loss=0.08017, over 4268401.64 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:43:56,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=586644.0, ans=0.0 2023-06-20 15:44:16,073 INFO [train.py:996] (0/4) Epoch 4, batch 6300, loss[loss=0.2241, simple_loss=0.3129, pruned_loss=0.06768, over 21659.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3108, pruned_loss=0.07856, over 4272358.01 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:44:47,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-20 15:44:48,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.254e+02 2.717e+02 3.393e+02 6.074e+02, threshold=5.434e+02, percent-clipped=2.0 2023-06-20 15:45:06,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=586824.0, ans=0.0 2023-06-20 15:45:31,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=586944.0, ans=0.125 2023-06-20 15:45:53,402 INFO [train.py:996] (0/4) Epoch 4, batch 6350, loss[loss=0.2493, simple_loss=0.3099, pruned_loss=0.09436, over 21432.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3161, pruned_loss=0.08316, over 4280697.46 frames. ], batch size: 211, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:46:27,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=587064.0, ans=0.125 2023-06-20 15:46:57,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=587124.0, ans=0.2 2023-06-20 15:47:34,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=587184.0, ans=0.0 2023-06-20 15:48:04,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=587244.0, ans=0.125 2023-06-20 15:48:08,584 INFO [train.py:996] (0/4) Epoch 4, batch 6400, loss[loss=0.3013, simple_loss=0.3607, pruned_loss=0.121, over 21450.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3235, pruned_loss=0.08775, over 4283638.18 frames. ], batch size: 471, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:48:10,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587304.0, ans=0.1 2023-06-20 15:48:32,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=587364.0, ans=0.0 2023-06-20 15:48:36,209 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.822e+02 3.385e+02 3.962e+02 5.879e+02, threshold=6.771e+02, percent-clipped=4.0 2023-06-20 15:48:40,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=15.0 2023-06-20 15:48:53,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=587424.0, ans=0.125 2023-06-20 15:49:01,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=587424.0, ans=0.09899494936611666 2023-06-20 15:49:19,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=587424.0, ans=0.0 2023-06-20 15:49:40,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=587484.0, ans=0.125 2023-06-20 15:49:52,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=587544.0, ans=0.0 2023-06-20 15:50:16,765 INFO [train.py:996] (0/4) Epoch 4, batch 6450, loss[loss=0.217, simple_loss=0.3061, pruned_loss=0.06392, over 21599.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3252, pruned_loss=0.08731, over 4285436.69 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 15:50:41,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-20 15:51:00,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.71 vs. limit=15.0 2023-06-20 15:51:19,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=587784.0, ans=0.025 2023-06-20 15:51:36,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=587844.0, ans=0.125 2023-06-20 15:51:42,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-20 15:51:52,948 INFO [train.py:996] (0/4) Epoch 4, batch 6500, loss[loss=0.2402, simple_loss=0.2957, pruned_loss=0.09233, over 21474.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3166, pruned_loss=0.08583, over 4288542.45 frames. ], batch size: 441, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:52:13,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=587964.0, ans=0.0 2023-06-20 15:52:19,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.385e+02 2.722e+02 3.333e+02 5.165e+02, threshold=5.444e+02, percent-clipped=0.0 2023-06-20 15:52:35,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=588024.0, ans=0.125 2023-06-20 15:52:46,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-20 15:53:03,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588084.0, ans=0.1 2023-06-20 15:53:05,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-20 15:53:25,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=588144.0, ans=0.125 2023-06-20 15:53:28,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=588144.0, ans=0.1 2023-06-20 15:53:43,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-20 15:53:49,881 INFO [train.py:996] (0/4) Epoch 4, batch 6550, loss[loss=0.305, simple_loss=0.3567, pruned_loss=0.1266, over 21636.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3145, pruned_loss=0.08481, over 4271685.41 frames. ], batch size: 507, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:53:50,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-20 15:53:51,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=588204.0, ans=0.0 2023-06-20 15:54:19,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=588324.0, ans=0.05 2023-06-20 15:54:27,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=588324.0, ans=0.125 2023-06-20 15:54:49,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588384.0, ans=0.1 2023-06-20 15:54:51,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-20 15:55:06,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=588384.0, ans=0.0 2023-06-20 15:55:25,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=588444.0, ans=0.125 2023-06-20 15:55:39,397 INFO [train.py:996] (0/4) Epoch 4, batch 6600, loss[loss=0.2617, simple_loss=0.3033, pruned_loss=0.11, over 21357.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3089, pruned_loss=0.08416, over 4262857.05 frames. ], batch size: 473, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:55:42,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=588504.0, ans=0.125 2023-06-20 15:55:48,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=588504.0, ans=0.2 2023-06-20 15:56:09,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.413e+02 2.653e+02 3.142e+02 5.278e+02, threshold=5.306e+02, percent-clipped=0.0 2023-06-20 15:56:19,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=588564.0, ans=0.0 2023-06-20 15:56:30,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-20 15:57:31,176 INFO [train.py:996] (0/4) Epoch 4, batch 6650, loss[loss=0.2373, simple_loss=0.2972, pruned_loss=0.08869, over 21603.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3028, pruned_loss=0.08018, over 4265341.06 frames. ], batch size: 415, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:57:48,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=588804.0, ans=0.125 2023-06-20 15:58:22,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-20 15:59:05,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=589044.0, ans=0.0 2023-06-20 15:59:13,814 INFO [train.py:996] (0/4) Epoch 4, batch 6700, loss[loss=0.2346, simple_loss=0.2937, pruned_loss=0.08779, over 21806.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2981, pruned_loss=0.07958, over 4261100.19 frames. ], batch size: 352, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 15:59:54,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.379e+02 2.739e+02 3.210e+02 5.153e+02, threshold=5.478e+02, percent-clipped=0.0 2023-06-20 16:00:08,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-20 16:00:15,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=589224.0, ans=0.125 2023-06-20 16:00:29,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589284.0, ans=0.1 2023-06-20 16:00:32,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=589284.0, ans=0.125 2023-06-20 16:00:44,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=589344.0, ans=0.07 2023-06-20 16:01:03,413 INFO [train.py:996] (0/4) Epoch 4, batch 6750, loss[loss=0.226, simple_loss=0.2932, pruned_loss=0.0794, over 21777.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.296, pruned_loss=0.07931, over 4256393.28 frames. ], batch size: 102, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:01:15,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=589404.0, ans=0.125 2023-06-20 16:01:59,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=589584.0, ans=0.0 2023-06-20 16:02:08,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-20 16:02:18,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=589644.0, ans=0.0 2023-06-20 16:02:39,945 INFO [train.py:996] (0/4) Epoch 4, batch 6800, loss[loss=0.2095, simple_loss=0.3226, pruned_loss=0.04825, over 19753.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2982, pruned_loss=0.08213, over 4271678.93 frames. ], batch size: 702, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:03:02,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.637e+02 2.976e+02 3.448e+02 7.542e+02, threshold=5.952e+02, percent-clipped=4.0 2023-06-20 16:03:23,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=589824.0, ans=0.0 2023-06-20 16:03:31,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-20 16:04:16,751 INFO [train.py:996] (0/4) Epoch 4, batch 6850, loss[loss=0.2215, simple_loss=0.2783, pruned_loss=0.08234, over 21608.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2969, pruned_loss=0.08351, over 4266663.64 frames. ], batch size: 263, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:04:18,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=590004.0, ans=0.025 2023-06-20 16:04:26,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=590004.0, ans=0.125 2023-06-20 16:04:33,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=590064.0, ans=0.125 2023-06-20 16:04:33,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=590064.0, ans=0.125 2023-06-20 16:04:35,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-06-20 16:05:55,773 INFO [train.py:996] (0/4) Epoch 4, batch 6900, loss[loss=0.2376, simple_loss=0.3312, pruned_loss=0.07202, over 21566.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3003, pruned_loss=0.08419, over 4272852.05 frames. ], batch size: 471, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 16:06:14,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=590304.0, ans=0.1 2023-06-20 16:06:31,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=590364.0, ans=12.0 2023-06-20 16:06:47,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.466e+02 2.910e+02 3.426e+02 7.067e+02, threshold=5.820e+02, percent-clipped=2.0 2023-06-20 16:07:50,464 INFO [train.py:996] (0/4) Epoch 4, batch 6950, loss[loss=0.2504, simple_loss=0.3221, pruned_loss=0.08937, over 21293.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3007, pruned_loss=0.0808, over 4277791.04 frames. ], batch size: 548, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:08:23,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=590664.0, ans=0.125 2023-06-20 16:09:31,674 INFO [train.py:996] (0/4) Epoch 4, batch 7000, loss[loss=0.2316, simple_loss=0.2904, pruned_loss=0.0864, over 21614.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3033, pruned_loss=0.08296, over 4276743.18 frames. ], batch size: 298, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:10:04,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.556e+02 2.969e+02 3.627e+02 5.556e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-20 16:10:11,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=591024.0, ans=0.125 2023-06-20 16:10:59,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=591084.0, ans=0.1 2023-06-20 16:11:09,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=591144.0, ans=0.1 2023-06-20 16:11:31,286 INFO [train.py:996] (0/4) Epoch 4, batch 7050, loss[loss=0.2063, simple_loss=0.2957, pruned_loss=0.05848, over 21177.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3005, pruned_loss=0.08195, over 4258301.00 frames. ], batch size: 548, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:11:42,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-20 16:11:43,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=591204.0, ans=0.125 2023-06-20 16:11:58,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-20 16:12:03,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=591264.0, ans=0.125 2023-06-20 16:12:08,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=591264.0, ans=0.07 2023-06-20 16:13:35,494 INFO [train.py:996] (0/4) Epoch 4, batch 7100, loss[loss=0.2402, simple_loss=0.3185, pruned_loss=0.08091, over 21746.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3065, pruned_loss=0.08306, over 4261355.30 frames. ], batch size: 332, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:13:57,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.233e+02 2.662e+02 3.259e+02 5.350e+02, threshold=5.324e+02, percent-clipped=0.0 2023-06-20 16:14:00,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=15.0 2023-06-20 16:14:36,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=591684.0, ans=0.1 2023-06-20 16:15:18,612 INFO [train.py:996] (0/4) Epoch 4, batch 7150, loss[loss=0.279, simple_loss=0.343, pruned_loss=0.1075, over 21907.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3045, pruned_loss=0.08123, over 4268355.56 frames. ], batch size: 372, lr: 8.29e-03, grad_scale: 32.0 2023-06-20 16:15:34,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-20 16:15:54,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=591924.0, ans=0.0 2023-06-20 16:16:09,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=591924.0, ans=0.1 2023-06-20 16:16:11,903 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:16:13,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-20 16:16:22,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=591984.0, ans=0.125 2023-06-20 16:16:59,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592044.0, ans=0.1 2023-06-20 16:17:04,854 INFO [train.py:996] (0/4) Epoch 4, batch 7200, loss[loss=0.2231, simple_loss=0.2811, pruned_loss=0.08253, over 21573.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3076, pruned_loss=0.08406, over 4271024.78 frames. ], batch size: 247, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:17:27,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 2.565e+02 2.895e+02 3.560e+02 7.126e+02, threshold=5.790e+02, percent-clipped=7.0 2023-06-20 16:17:45,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-20 16:17:56,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=592224.0, ans=0.0 2023-06-20 16:18:03,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-20 16:18:41,923 INFO [train.py:996] (0/4) Epoch 4, batch 7250, loss[loss=0.2306, simple_loss=0.2828, pruned_loss=0.08917, over 21368.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3037, pruned_loss=0.08338, over 4273682.30 frames. ], batch size: 177, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:19:34,286 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:20:19,815 INFO [train.py:996] (0/4) Epoch 4, batch 7300, loss[loss=0.2087, simple_loss=0.2628, pruned_loss=0.07727, over 21197.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2977, pruned_loss=0.08206, over 4264641.52 frames. ], batch size: 176, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:20:29,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=592704.0, ans=0.2 2023-06-20 16:20:51,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=592764.0, ans=0.1 2023-06-20 16:20:53,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.476e+02 2.833e+02 3.500e+02 5.020e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-20 16:20:59,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=592764.0, ans=0.125 2023-06-20 16:21:00,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=592824.0, ans=0.02 2023-06-20 16:21:52,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-20 16:22:03,831 INFO [train.py:996] (0/4) Epoch 4, batch 7350, loss[loss=0.3195, simple_loss=0.3611, pruned_loss=0.1389, over 21444.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2957, pruned_loss=0.08239, over 4272933.04 frames. ], batch size: 510, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:22:46,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=593064.0, ans=0.125 2023-06-20 16:23:17,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=593184.0, ans=0.125 2023-06-20 16:23:21,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-20 16:23:56,341 INFO [train.py:996] (0/4) Epoch 4, batch 7400, loss[loss=0.214, simple_loss=0.2877, pruned_loss=0.07015, over 21472.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3036, pruned_loss=0.08599, over 4275764.39 frames. ], batch size: 212, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 16:24:23,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-20 16:24:24,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.858e+02 3.325e+02 3.971e+02 5.288e+02, threshold=6.650e+02, percent-clipped=0.0 2023-06-20 16:25:02,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=593484.0, ans=0.125 2023-06-20 16:25:33,490 INFO [train.py:996] (0/4) Epoch 4, batch 7450, loss[loss=0.2076, simple_loss=0.269, pruned_loss=0.07309, over 21583.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3032, pruned_loss=0.08532, over 4280178.16 frames. ], batch size: 230, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:26:37,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593784.0, ans=0.1 2023-06-20 16:27:34,237 INFO [train.py:996] (0/4) Epoch 4, batch 7500, loss[loss=0.2585, simple_loss=0.3628, pruned_loss=0.07708, over 21870.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3073, pruned_loss=0.08697, over 4268674.65 frames. ], batch size: 317, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:27:36,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=593904.0, ans=0.0 2023-06-20 16:27:39,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-20 16:28:09,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.540e+02 2.828e+02 3.392e+02 7.342e+02, threshold=5.655e+02, percent-clipped=3.0 2023-06-20 16:28:15,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=594024.0, ans=0.0 2023-06-20 16:29:13,483 INFO [train.py:996] (0/4) Epoch 4, batch 7550, loss[loss=0.2422, simple_loss=0.3376, pruned_loss=0.07343, over 21722.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3163, pruned_loss=0.08637, over 4265117.02 frames. ], batch size: 298, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:29:58,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-20 16:30:24,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=594384.0, ans=0.2 2023-06-20 16:30:49,999 INFO [train.py:996] (0/4) Epoch 4, batch 7600, loss[loss=0.2805, simple_loss=0.3489, pruned_loss=0.1061, over 21861.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.314, pruned_loss=0.08462, over 4265413.32 frames. ], batch size: 107, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:31:02,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=594504.0, ans=0.0 2023-06-20 16:31:17,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.434e+02 2.853e+02 3.264e+02 4.790e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-20 16:31:29,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=594624.0, ans=0.0 2023-06-20 16:31:32,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594624.0, ans=0.1 2023-06-20 16:31:56,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=594684.0, ans=0.125 2023-06-20 16:32:25,976 INFO [train.py:996] (0/4) Epoch 4, batch 7650, loss[loss=0.2398, simple_loss=0.3044, pruned_loss=0.08755, over 21939.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.312, pruned_loss=0.08608, over 4274318.19 frames. ], batch size: 316, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 16:32:35,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=594804.0, ans=0.125 2023-06-20 16:32:46,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=594864.0, ans=0.125 2023-06-20 16:33:07,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=594924.0, ans=0.125 2023-06-20 16:33:13,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=594924.0, ans=0.0 2023-06-20 16:33:23,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=594984.0, ans=0.1 2023-06-20 16:34:04,135 INFO [train.py:996] (0/4) Epoch 4, batch 7700, loss[loss=0.2744, simple_loss=0.355, pruned_loss=0.09688, over 16893.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3155, pruned_loss=0.0884, over 4279405.75 frames. ], batch size: 60, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:34:13,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=595104.0, ans=0.1 2023-06-20 16:34:33,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=595164.0, ans=0.125 2023-06-20 16:34:37,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=595164.0, ans=0.125 2023-06-20 16:34:38,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-20 16:34:40,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.502e+02 2.863e+02 3.403e+02 5.125e+02, threshold=5.726e+02, percent-clipped=0.0 2023-06-20 16:34:45,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=595224.0, ans=0.1 2023-06-20 16:36:09,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=595404.0, ans=0.0 2023-06-20 16:36:10,694 INFO [train.py:996] (0/4) Epoch 4, batch 7750, loss[loss=0.2167, simple_loss=0.3038, pruned_loss=0.06478, over 21433.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3203, pruned_loss=0.08876, over 4267880.72 frames. ], batch size: 131, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:37:15,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=595584.0, ans=0.2 2023-06-20 16:37:30,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=595584.0, ans=0.1 2023-06-20 16:38:19,751 INFO [train.py:996] (0/4) Epoch 4, batch 7800, loss[loss=0.2091, simple_loss=0.2563, pruned_loss=0.08094, over 21172.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3217, pruned_loss=0.08849, over 4267588.21 frames. ], batch size: 159, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 16:38:37,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=595764.0, ans=0.125 2023-06-20 16:38:49,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.887e+02 3.418e+02 4.293e+02 7.867e+02, threshold=6.836e+02, percent-clipped=4.0 2023-06-20 16:39:20,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=595884.0, ans=0.125 2023-06-20 16:39:34,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=595944.0, ans=0.0 2023-06-20 16:39:58,237 INFO [train.py:996] (0/4) Epoch 4, batch 7850, loss[loss=0.2306, simple_loss=0.2899, pruned_loss=0.08564, over 21906.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.315, pruned_loss=0.0874, over 4252060.36 frames. ], batch size: 373, lr: 8.26e-03, grad_scale: 16.0 2023-06-20 16:41:02,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=596184.0, ans=0.1 2023-06-20 16:41:03,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-20 16:41:04,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=596184.0, ans=10.0 2023-06-20 16:41:48,967 INFO [train.py:996] (0/4) Epoch 4, batch 7900, loss[loss=0.2312, simple_loss=0.3051, pruned_loss=0.07871, over 21403.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3103, pruned_loss=0.08593, over 4248759.30 frames. ], batch size: 211, lr: 8.26e-03, grad_scale: 16.0 2023-06-20 16:42:04,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=596304.0, ans=0.0 2023-06-20 16:42:35,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.748e+02 3.368e+02 4.194e+02 8.125e+02, threshold=6.737e+02, percent-clipped=4.0 2023-06-20 16:42:54,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=596424.0, ans=0.0 2023-06-20 16:43:45,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=596544.0, ans=0.125 2023-06-20 16:43:52,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=596544.0, ans=0.0 2023-06-20 16:43:56,411 INFO [train.py:996] (0/4) Epoch 4, batch 7950, loss[loss=0.2444, simple_loss=0.3233, pruned_loss=0.08279, over 21932.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.311, pruned_loss=0.08498, over 4248940.56 frames. ], batch size: 316, lr: 8.25e-03, grad_scale: 16.0 2023-06-20 16:44:22,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=596664.0, ans=0.0 2023-06-20 16:44:44,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=596724.0, ans=0.1 2023-06-20 16:45:29,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=596784.0, ans=0.125 2023-06-20 16:45:29,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=596784.0, ans=0.125 2023-06-20 16:45:56,308 INFO [train.py:996] (0/4) Epoch 4, batch 8000, loss[loss=0.2664, simple_loss=0.3323, pruned_loss=0.1002, over 21326.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3171, pruned_loss=0.08749, over 4262098.33 frames. ], batch size: 548, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:46:27,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.568e+02 2.923e+02 3.494e+02 7.833e+02, threshold=5.846e+02, percent-clipped=1.0 2023-06-20 16:46:41,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=597024.0, ans=0.125 2023-06-20 16:48:01,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-20 16:48:21,272 INFO [train.py:996] (0/4) Epoch 4, batch 8050, loss[loss=0.2669, simple_loss=0.3486, pruned_loss=0.09258, over 21744.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3218, pruned_loss=0.08886, over 4269389.62 frames. ], batch size: 332, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:48:29,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=597204.0, ans=0.0 2023-06-20 16:48:32,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=597204.0, ans=0.125 2023-06-20 16:48:34,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=12.0 2023-06-20 16:48:41,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=597264.0, ans=0.125 2023-06-20 16:49:10,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597324.0, ans=0.1 2023-06-20 16:49:19,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=597384.0, ans=0.5 2023-06-20 16:49:42,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=597444.0, ans=0.125 2023-06-20 16:49:42,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-20 16:50:00,268 INFO [train.py:996] (0/4) Epoch 4, batch 8100, loss[loss=0.2245, simple_loss=0.2954, pruned_loss=0.07678, over 21681.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3209, pruned_loss=0.08982, over 4273519.30 frames. ], batch size: 263, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 16:50:20,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=597564.0, ans=0.125 2023-06-20 16:50:38,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 2.976e+02 3.712e+02 5.223e+02 1.010e+03, threshold=7.424e+02, percent-clipped=11.0 2023-06-20 16:50:40,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=597564.0, ans=0.0 2023-06-20 16:50:48,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=597624.0, ans=0.125 2023-06-20 16:52:09,556 INFO [train.py:996] (0/4) Epoch 4, batch 8150, loss[loss=0.278, simple_loss=0.3724, pruned_loss=0.09182, over 21694.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3287, pruned_loss=0.09091, over 4269465.64 frames. ], batch size: 414, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:52:12,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-20 16:52:13,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=597804.0, ans=0.0 2023-06-20 16:53:16,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=597984.0, ans=0.125 2023-06-20 16:53:24,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=597984.0, ans=15.0 2023-06-20 16:53:50,591 INFO [train.py:996] (0/4) Epoch 4, batch 8200, loss[loss=0.2521, simple_loss=0.298, pruned_loss=0.1031, over 21351.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3216, pruned_loss=0.08842, over 4257628.52 frames. ], batch size: 473, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:54:01,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=598104.0, ans=0.125 2023-06-20 16:54:15,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=598104.0, ans=0.2 2023-06-20 16:54:18,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=598104.0, ans=0.125 2023-06-20 16:54:24,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598164.0, ans=0.1 2023-06-20 16:54:36,858 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.510e+02 2.906e+02 3.709e+02 6.115e+02, threshold=5.811e+02, percent-clipped=0.0 2023-06-20 16:54:38,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=598164.0, ans=0.0 2023-06-20 16:55:54,977 INFO [train.py:996] (0/4) Epoch 4, batch 8250, loss[loss=0.2148, simple_loss=0.3037, pruned_loss=0.06297, over 21693.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3227, pruned_loss=0.08928, over 4263930.19 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:56:07,581 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-20 16:57:03,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=598584.0, ans=0.125 2023-06-20 16:57:18,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=598644.0, ans=0.125 2023-06-20 16:57:38,849 INFO [train.py:996] (0/4) Epoch 4, batch 8300, loss[loss=0.2071, simple_loss=0.2799, pruned_loss=0.06718, over 21240.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3193, pruned_loss=0.08637, over 4260536.84 frames. ], batch size: 159, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:57:45,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598704.0, ans=0.1 2023-06-20 16:58:09,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.374e+02 2.802e+02 3.318e+02 5.012e+02, threshold=5.604e+02, percent-clipped=0.0 2023-06-20 16:58:19,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=598824.0, ans=0.2 2023-06-20 16:58:44,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=598884.0, ans=0.0 2023-06-20 16:59:01,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=598944.0, ans=0.0 2023-06-20 16:59:15,924 INFO [train.py:996] (0/4) Epoch 4, batch 8350, loss[loss=0.2086, simple_loss=0.2834, pruned_loss=0.06694, over 21218.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3154, pruned_loss=0.0836, over 4250780.68 frames. ], batch size: 176, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 16:59:27,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=599004.0, ans=0.125 2023-06-20 16:59:40,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=599064.0, ans=0.2 2023-06-20 16:59:40,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=599064.0, ans=15.0 2023-06-20 16:59:42,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=599064.0, ans=0.125 2023-06-20 16:59:52,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=599124.0, ans=0.125 2023-06-20 17:00:28,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=599184.0, ans=0.0 2023-06-20 17:00:43,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-20 17:00:54,167 INFO [train.py:996] (0/4) Epoch 4, batch 8400, loss[loss=0.1846, simple_loss=0.2449, pruned_loss=0.06212, over 21907.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3134, pruned_loss=0.08168, over 4259718.12 frames. ], batch size: 98, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:00:56,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-20 17:00:57,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=599304.0, ans=0.0 2023-06-20 17:01:13,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=599364.0, ans=0.1 2023-06-20 17:01:14,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=599364.0, ans=0.125 2023-06-20 17:01:24,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.360e+02 2.679e+02 3.034e+02 4.807e+02, threshold=5.358e+02, percent-clipped=0.0 2023-06-20 17:01:25,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-20 17:01:29,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=599424.0, ans=0.125 2023-06-20 17:02:38,971 INFO [train.py:996] (0/4) Epoch 4, batch 8450, loss[loss=0.2324, simple_loss=0.2918, pruned_loss=0.08648, over 21679.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3127, pruned_loss=0.08211, over 4268303.58 frames. ], batch size: 263, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:03:10,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=599664.0, ans=0.125 2023-06-20 17:03:38,388 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:03:48,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=599724.0, ans=0.125 2023-06-20 17:04:34,201 INFO [train.py:996] (0/4) Epoch 4, batch 8500, loss[loss=0.2612, simple_loss=0.3178, pruned_loss=0.1024, over 21756.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3085, pruned_loss=0.08295, over 4274781.15 frames. ], batch size: 316, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:04:36,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=599904.0, ans=0.125 2023-06-20 17:05:03,606 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-100000.pt 2023-06-20 17:05:07,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=599964.0, ans=10.0 2023-06-20 17:05:09,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.562e+02 2.810e+02 3.301e+02 4.951e+02, threshold=5.621e+02, percent-clipped=0.0 2023-06-20 17:05:16,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=600024.0, ans=0.02 2023-06-20 17:06:14,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=600144.0, ans=0.0 2023-06-20 17:06:17,609 INFO [train.py:996] (0/4) Epoch 4, batch 8550, loss[loss=0.2087, simple_loss=0.2814, pruned_loss=0.06799, over 21337.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3135, pruned_loss=0.08535, over 4271269.33 frames. ], batch size: 144, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:06:19,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=600204.0, ans=0.125 2023-06-20 17:07:16,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=600384.0, ans=0.0 2023-06-20 17:07:28,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600384.0, ans=0.1 2023-06-20 17:07:36,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=600384.0, ans=0.0 2023-06-20 17:07:38,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=600384.0, ans=0.0 2023-06-20 17:07:40,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=600444.0, ans=0.0 2023-06-20 17:07:41,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=600444.0, ans=0.0 2023-06-20 17:08:03,049 INFO [train.py:996] (0/4) Epoch 4, batch 8600, loss[loss=0.2971, simple_loss=0.3971, pruned_loss=0.09857, over 19826.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3209, pruned_loss=0.08775, over 4263288.47 frames. ], batch size: 702, lr: 8.23e-03, grad_scale: 32.0 2023-06-20 17:08:41,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=600564.0, ans=0.2 2023-06-20 17:08:49,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.757e+02 3.185e+02 4.135e+02 6.803e+02, threshold=6.371e+02, percent-clipped=9.0 2023-06-20 17:08:54,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=600624.0, ans=0.125 2023-06-20 17:08:54,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=600624.0, ans=0.0 2023-06-20 17:09:07,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=600624.0, ans=0.125 2023-06-20 17:09:34,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600684.0, ans=0.1 2023-06-20 17:10:02,025 INFO [train.py:996] (0/4) Epoch 4, batch 8650, loss[loss=0.193, simple_loss=0.294, pruned_loss=0.04596, over 21783.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3271, pruned_loss=0.08824, over 4270843.36 frames. ], batch size: 332, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:10:10,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=600804.0, ans=0.0 2023-06-20 17:10:11,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=600804.0, ans=0.125 2023-06-20 17:10:37,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=600924.0, ans=0.035 2023-06-20 17:10:43,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=600924.0, ans=0.05 2023-06-20 17:11:16,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-20 17:11:30,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-20 17:11:31,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=601044.0, ans=0.125 2023-06-20 17:11:38,694 INFO [train.py:996] (0/4) Epoch 4, batch 8700, loss[loss=0.2104, simple_loss=0.2747, pruned_loss=0.07304, over 21846.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3187, pruned_loss=0.0833, over 4262975.42 frames. ], batch size: 107, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:12:09,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.376e+02 2.840e+02 3.647e+02 6.545e+02, threshold=5.680e+02, percent-clipped=1.0 2023-06-20 17:12:09,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=601164.0, ans=0.125 2023-06-20 17:12:12,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=601224.0, ans=0.2 2023-06-20 17:12:25,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=601224.0, ans=0.125 2023-06-20 17:13:11,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=601344.0, ans=0.125 2023-06-20 17:13:24,747 INFO [train.py:996] (0/4) Epoch 4, batch 8750, loss[loss=0.2426, simple_loss=0.3107, pruned_loss=0.08727, over 21814.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3151, pruned_loss=0.08448, over 4260460.60 frames. ], batch size: 298, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:13:26,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-20 17:14:44,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=601584.0, ans=0.125 2023-06-20 17:14:58,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=601644.0, ans=0.125 2023-06-20 17:14:59,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=601644.0, ans=0.2 2023-06-20 17:15:02,031 INFO [train.py:996] (0/4) Epoch 4, batch 8800, loss[loss=0.303, simple_loss=0.3784, pruned_loss=0.1138, over 21590.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3228, pruned_loss=0.08767, over 4266700.45 frames. ], batch size: 389, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:15:42,502 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.769e+02 3.106e+02 3.590e+02 5.947e+02, threshold=6.211e+02, percent-clipped=1.0 2023-06-20 17:16:37,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-20 17:16:41,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=601884.0, ans=0.125 2023-06-20 17:16:59,676 INFO [train.py:996] (0/4) Epoch 4, batch 8850, loss[loss=0.2389, simple_loss=0.3289, pruned_loss=0.07444, over 15965.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3292, pruned_loss=0.08944, over 4261443.17 frames. ], batch size: 61, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 17:18:35,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=602244.0, ans=0.1 2023-06-20 17:18:40,454 INFO [train.py:996] (0/4) Epoch 4, batch 8900, loss[loss=0.2058, simple_loss=0.274, pruned_loss=0.06885, over 21322.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3233, pruned_loss=0.08838, over 4259453.99 frames. ], batch size: 194, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:19:30,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.687e+02 3.012e+02 3.727e+02 9.034e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-20 17:19:55,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-20 17:20:18,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=602484.0, ans=0.2 2023-06-20 17:20:55,165 INFO [train.py:996] (0/4) Epoch 4, batch 8950, loss[loss=0.2034, simple_loss=0.2589, pruned_loss=0.07388, over 21193.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3236, pruned_loss=0.08788, over 4255754.17 frames. ], batch size: 159, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:21:16,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=602604.0, ans=0.0 2023-06-20 17:21:51,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=602724.0, ans=0.125 2023-06-20 17:22:37,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=602904.0, ans=0.2 2023-06-20 17:22:38,302 INFO [train.py:996] (0/4) Epoch 4, batch 9000, loss[loss=0.1943, simple_loss=0.2503, pruned_loss=0.06917, over 21428.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3181, pruned_loss=0.08716, over 4261016.45 frames. ], batch size: 212, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 17:22:38,303 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 17:23:23,915 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2895, 1.9698, 3.7737, 3.6731], device='cuda:0') 2023-06-20 17:23:29,200 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2733, simple_loss=0.3656, pruned_loss=0.09047, over 1796401.00 frames. 2023-06-20 17:23:29,201 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 17:23:48,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=602964.0, ans=0.125 2023-06-20 17:23:54,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=602964.0, ans=0.125 2023-06-20 17:23:55,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.920e+02 3.392e+02 4.044e+02 7.869e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 17:24:48,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-20 17:25:06,920 INFO [train.py:996] (0/4) Epoch 4, batch 9050, loss[loss=0.2388, simple_loss=0.3159, pruned_loss=0.08091, over 21554.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3146, pruned_loss=0.08364, over 4252590.06 frames. ], batch size: 389, lr: 8.21e-03, grad_scale: 16.0 2023-06-20 17:25:07,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=603204.0, ans=0.125 2023-06-20 17:26:17,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-06-20 17:27:03,502 INFO [train.py:996] (0/4) Epoch 4, batch 9100, loss[loss=0.2549, simple_loss=0.3334, pruned_loss=0.08826, over 21449.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3198, pruned_loss=0.08539, over 4256278.95 frames. ], batch size: 131, lr: 8.21e-03, grad_scale: 16.0 2023-06-20 17:27:21,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=603504.0, ans=0.0 2023-06-20 17:27:22,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=603504.0, ans=0.07 2023-06-20 17:27:57,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.581e+02 3.087e+02 3.555e+02 5.100e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-20 17:28:14,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603624.0, ans=0.1 2023-06-20 17:29:07,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=603804.0, ans=0.04949747468305833 2023-06-20 17:29:08,293 INFO [train.py:996] (0/4) Epoch 4, batch 9150, loss[loss=0.246, simple_loss=0.3368, pruned_loss=0.07757, over 21783.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.319, pruned_loss=0.08388, over 4256000.68 frames. ], batch size: 332, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 17:29:32,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.16 vs. limit=22.5 2023-06-20 17:29:39,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=603864.0, ans=0.0 2023-06-20 17:30:30,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=603984.0, ans=0.04949747468305833 2023-06-20 17:30:45,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-20 17:30:46,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=604044.0, ans=0.125 2023-06-20 17:31:03,051 INFO [train.py:996] (0/4) Epoch 4, batch 9200, loss[loss=0.2787, simple_loss=0.3429, pruned_loss=0.1072, over 21397.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3215, pruned_loss=0.08281, over 4263466.60 frames. ], batch size: 131, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:31:47,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.40 vs. limit=10.0 2023-06-20 17:31:47,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.497e+02 2.858e+02 3.581e+02 5.930e+02, threshold=5.716e+02, percent-clipped=0.0 2023-06-20 17:32:27,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=604284.0, ans=0.125 2023-06-20 17:32:54,610 INFO [train.py:996] (0/4) Epoch 4, batch 9250, loss[loss=0.3277, simple_loss=0.4391, pruned_loss=0.1081, over 19761.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3255, pruned_loss=0.08629, over 4267602.06 frames. ], batch size: 702, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:33:08,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=604404.0, ans=0.125 2023-06-20 17:33:23,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=604464.0, ans=0.0 2023-06-20 17:34:36,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=604644.0, ans=0.09899494936611666 2023-06-20 17:34:38,707 INFO [train.py:996] (0/4) Epoch 4, batch 9300, loss[loss=0.2406, simple_loss=0.3267, pruned_loss=0.07726, over 21531.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3207, pruned_loss=0.08623, over 4261161.13 frames. ], batch size: 230, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:35:08,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=604764.0, ans=0.125 2023-06-20 17:35:22,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.268e+02 3.884e+02 4.664e+02 7.347e+02, threshold=7.768e+02, percent-clipped=7.0 2023-06-20 17:36:17,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=15.0 2023-06-20 17:36:18,035 INFO [train.py:996] (0/4) Epoch 4, batch 9350, loss[loss=0.3014, simple_loss=0.3684, pruned_loss=0.1172, over 21416.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3257, pruned_loss=0.08748, over 4267414.23 frames. ], batch size: 471, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 17:36:30,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605004.0, ans=0.1 2023-06-20 17:36:33,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=605004.0, ans=0.0 2023-06-20 17:36:59,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.92 vs. limit=6.0 2023-06-20 17:37:13,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=605124.0, ans=0.0 2023-06-20 17:37:25,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=605184.0, ans=0.125 2023-06-20 17:37:38,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=605244.0, ans=0.1 2023-06-20 17:37:42,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-20 17:38:01,949 INFO [train.py:996] (0/4) Epoch 4, batch 9400, loss[loss=0.2285, simple_loss=0.2876, pruned_loss=0.08476, over 21348.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3284, pruned_loss=0.08844, over 4268346.89 frames. ], batch size: 160, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:38:35,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.508e+02 2.883e+02 3.302e+02 5.601e+02, threshold=5.767e+02, percent-clipped=0.0 2023-06-20 17:38:46,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=605424.0, ans=0.025 2023-06-20 17:38:50,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-20 17:38:52,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=605484.0, ans=0.025 2023-06-20 17:39:08,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-20 17:39:45,258 INFO [train.py:996] (0/4) Epoch 4, batch 9450, loss[loss=0.2236, simple_loss=0.2862, pruned_loss=0.08053, over 21814.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3194, pruned_loss=0.08695, over 4261663.84 frames. ], batch size: 118, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:40:14,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=22.5 2023-06-20 17:40:48,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=605784.0, ans=0.0 2023-06-20 17:40:50,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=605844.0, ans=0.125 2023-06-20 17:41:18,637 INFO [train.py:996] (0/4) Epoch 4, batch 9500, loss[loss=0.2188, simple_loss=0.2819, pruned_loss=0.07781, over 21659.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3111, pruned_loss=0.08465, over 4260239.79 frames. ], batch size: 333, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:41:20,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=605904.0, ans=0.125 2023-06-20 17:41:41,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-20 17:41:53,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.513e+02 2.842e+02 3.687e+02 6.250e+02, threshold=5.685e+02, percent-clipped=3.0 2023-06-20 17:42:05,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-20 17:43:05,080 INFO [train.py:996] (0/4) Epoch 4, batch 9550, loss[loss=0.3006, simple_loss=0.373, pruned_loss=0.1141, over 21730.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3149, pruned_loss=0.08647, over 4264405.65 frames. ], batch size: 441, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 17:43:08,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=606204.0, ans=0.05 2023-06-20 17:43:55,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=606384.0, ans=0.125 2023-06-20 17:43:59,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=606384.0, ans=0.125 2023-06-20 17:44:16,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=606384.0, ans=0.2 2023-06-20 17:44:49,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=606444.0, ans=0.0 2023-06-20 17:44:55,269 INFO [train.py:996] (0/4) Epoch 4, batch 9600, loss[loss=0.2127, simple_loss=0.2843, pruned_loss=0.07051, over 21819.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3174, pruned_loss=0.08769, over 4272787.67 frames. ], batch size: 247, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 17:45:07,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=606504.0, ans=0.1 2023-06-20 17:45:29,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.596e+02 2.958e+02 3.550e+02 7.860e+02, threshold=5.916e+02, percent-clipped=4.0 2023-06-20 17:45:31,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=606624.0, ans=0.0 2023-06-20 17:45:32,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=606624.0, ans=0.2 2023-06-20 17:45:34,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=606624.0, ans=0.05 2023-06-20 17:45:37,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=606624.0, ans=0.0 2023-06-20 17:46:22,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=606744.0, ans=0.1 2023-06-20 17:46:32,678 INFO [train.py:996] (0/4) Epoch 4, batch 9650, loss[loss=0.2433, simple_loss=0.3102, pruned_loss=0.08814, over 21748.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.316, pruned_loss=0.08734, over 4280876.36 frames. ], batch size: 298, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:46:53,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=606864.0, ans=0.125 2023-06-20 17:47:03,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=606864.0, ans=0.2 2023-06-20 17:47:06,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=606924.0, ans=0.125 2023-06-20 17:47:07,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-20 17:47:09,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-20 17:48:18,428 INFO [train.py:996] (0/4) Epoch 4, batch 9700, loss[loss=0.2849, simple_loss=0.3515, pruned_loss=0.1092, over 21535.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3197, pruned_loss=0.08749, over 4280098.70 frames. ], batch size: 471, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:48:52,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.445e+02 2.765e+02 3.359e+02 4.931e+02, threshold=5.531e+02, percent-clipped=0.0 2023-06-20 17:48:55,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=607224.0, ans=0.0 2023-06-20 17:50:06,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=607404.0, ans=0.2 2023-06-20 17:50:13,092 INFO [train.py:996] (0/4) Epoch 4, batch 9750, loss[loss=0.2119, simple_loss=0.2472, pruned_loss=0.08831, over 20030.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3134, pruned_loss=0.08643, over 4278099.11 frames. ], batch size: 703, lr: 8.18e-03, grad_scale: 32.0 2023-06-20 17:50:52,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=607524.0, ans=0.1 2023-06-20 17:51:43,075 INFO [train.py:996] (0/4) Epoch 4, batch 9800, loss[loss=0.2254, simple_loss=0.2931, pruned_loss=0.07882, over 21610.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3133, pruned_loss=0.08685, over 4274591.23 frames. ], batch size: 263, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 17:51:45,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-20 17:52:18,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.616e+02 2.976e+02 3.497e+02 5.489e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-20 17:52:31,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=607824.0, ans=0.125 2023-06-20 17:53:19,582 INFO [train.py:996] (0/4) Epoch 4, batch 9850, loss[loss=0.2285, simple_loss=0.2871, pruned_loss=0.08493, over 21567.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3114, pruned_loss=0.08722, over 4259632.97 frames. ], batch size: 391, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 17:53:20,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=608004.0, ans=0.125 2023-06-20 17:53:57,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=608124.0, ans=0.125 2023-06-20 17:54:01,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=608124.0, ans=0.125 2023-06-20 17:55:00,052 INFO [train.py:996] (0/4) Epoch 4, batch 9900, loss[loss=0.2254, simple_loss=0.284, pruned_loss=0.08347, over 15440.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3078, pruned_loss=0.08715, over 4239340.09 frames. ], batch size: 61, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 17:55:34,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=12.0 2023-06-20 17:55:38,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.623e+02 2.955e+02 3.593e+02 6.748e+02, threshold=5.910e+02, percent-clipped=3.0 2023-06-20 17:55:55,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-20 17:56:42,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=608544.0, ans=0.125 2023-06-20 17:56:45,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 17:56:45,772 INFO [train.py:996] (0/4) Epoch 4, batch 9950, loss[loss=0.2242, simple_loss=0.2758, pruned_loss=0.08627, over 21390.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.311, pruned_loss=0.08928, over 4249784.95 frames. ], batch size: 211, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 17:56:46,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=608604.0, ans=0.2 2023-06-20 17:56:54,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=608604.0, ans=0.1 2023-06-20 17:57:09,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=608664.0, ans=0.1 2023-06-20 17:57:11,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=608664.0, ans=0.125 2023-06-20 17:57:28,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=608724.0, ans=0.0 2023-06-20 17:58:36,072 INFO [train.py:996] (0/4) Epoch 4, batch 10000, loss[loss=0.2111, simple_loss=0.2816, pruned_loss=0.07027, over 21644.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3058, pruned_loss=0.08699, over 4244441.23 frames. ], batch size: 391, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 17:59:06,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.693e+02 3.176e+02 3.942e+02 5.803e+02, threshold=6.352e+02, percent-clipped=0.0 2023-06-20 17:59:10,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=609024.0, ans=0.0 2023-06-20 17:59:26,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-20 18:00:26,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-20 18:00:30,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=609144.0, ans=0.0 2023-06-20 18:00:33,093 INFO [train.py:996] (0/4) Epoch 4, batch 10050, loss[loss=0.1847, simple_loss=0.2632, pruned_loss=0.0531, over 20748.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.307, pruned_loss=0.08739, over 4253912.21 frames. ], batch size: 608, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 18:00:35,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609204.0, ans=0.1 2023-06-20 18:00:52,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=609264.0, ans=0.0 2023-06-20 18:01:34,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0 2023-06-20 18:01:59,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=609444.0, ans=0.07 2023-06-20 18:02:11,240 INFO [train.py:996] (0/4) Epoch 4, batch 10100, loss[loss=0.2418, simple_loss=0.3213, pruned_loss=0.08115, over 21915.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3032, pruned_loss=0.08415, over 4257565.94 frames. ], batch size: 316, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 18:02:48,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.616e+02 3.061e+02 3.552e+02 5.046e+02, threshold=6.121e+02, percent-clipped=0.0 2023-06-20 18:03:04,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=609624.0, ans=0.125 2023-06-20 18:03:12,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=609624.0, ans=0.125 2023-06-20 18:03:36,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=609684.0, ans=0.125 2023-06-20 18:03:49,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=609744.0, ans=0.2 2023-06-20 18:03:56,768 INFO [train.py:996] (0/4) Epoch 4, batch 10150, loss[loss=0.2568, simple_loss=0.3263, pruned_loss=0.09364, over 21668.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3104, pruned_loss=0.08756, over 4260536.09 frames. ], batch size: 332, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:04:06,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=609804.0, ans=0.0 2023-06-20 18:04:59,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-20 18:05:24,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=610044.0, ans=0.0 2023-06-20 18:05:34,319 INFO [train.py:996] (0/4) Epoch 4, batch 10200, loss[loss=0.2117, simple_loss=0.2937, pruned_loss=0.06486, over 21747.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3098, pruned_loss=0.08513, over 4258303.84 frames. ], batch size: 282, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:05:34,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=610104.0, ans=0.125 2023-06-20 18:05:36,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=610104.0, ans=0.125 2023-06-20 18:05:46,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=610104.0, ans=0.0 2023-06-20 18:05:54,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=610164.0, ans=0.0 2023-06-20 18:06:09,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.303e+02 2.650e+02 3.100e+02 6.273e+02, threshold=5.301e+02, percent-clipped=1.0 2023-06-20 18:06:10,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=610224.0, ans=0.2 2023-06-20 18:06:26,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=610224.0, ans=0.125 2023-06-20 18:06:36,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=610284.0, ans=0.125 2023-06-20 18:07:06,499 INFO [train.py:996] (0/4) Epoch 4, batch 10250, loss[loss=0.2449, simple_loss=0.3253, pruned_loss=0.08225, over 21620.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3068, pruned_loss=0.08121, over 4246085.60 frames. ], batch size: 389, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:07:08,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=610404.0, ans=0.0 2023-06-20 18:07:11,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=610404.0, ans=0.125 2023-06-20 18:07:16,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-20 18:08:59,163 INFO [train.py:996] (0/4) Epoch 4, batch 10300, loss[loss=0.264, simple_loss=0.3458, pruned_loss=0.09108, over 21427.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3085, pruned_loss=0.08108, over 4255737.99 frames. ], batch size: 131, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:09:42,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=610764.0, ans=0.1 2023-06-20 18:10:07,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 2.456e+02 2.901e+02 3.440e+02 5.624e+02, threshold=5.802e+02, percent-clipped=3.0 2023-06-20 18:10:11,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=610824.0, ans=0.125 2023-06-20 18:10:35,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=610884.0, ans=0.125 2023-06-20 18:10:39,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=610944.0, ans=0.125 2023-06-20 18:11:07,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=610944.0, ans=0.0 2023-06-20 18:11:08,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=15.0 2023-06-20 18:11:11,074 INFO [train.py:996] (0/4) Epoch 4, batch 10350, loss[loss=0.2321, simple_loss=0.3072, pruned_loss=0.07847, over 21890.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3112, pruned_loss=0.08161, over 4255437.58 frames. ], batch size: 373, lr: 8.16e-03, grad_scale: 32.0 2023-06-20 18:11:50,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=611064.0, ans=0.125 2023-06-20 18:11:55,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=611064.0, ans=0.125 2023-06-20 18:11:55,749 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:12:17,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=611184.0, ans=0.07 2023-06-20 18:12:19,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=611184.0, ans=0.125 2023-06-20 18:12:21,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611184.0, ans=0.1 2023-06-20 18:12:28,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=611244.0, ans=0.0 2023-06-20 18:12:54,575 INFO [train.py:996] (0/4) Epoch 4, batch 10400, loss[loss=0.2428, simple_loss=0.3105, pruned_loss=0.08751, over 21892.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3031, pruned_loss=0.07976, over 4255962.23 frames. ], batch size: 373, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:13:12,910 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:13:27,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611364.0, ans=0.1 2023-06-20 18:13:36,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.614e+02 3.049e+02 3.745e+02 5.860e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 18:13:38,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=611424.0, ans=0.125 2023-06-20 18:13:39,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-06-20 18:14:10,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611484.0, ans=0.1 2023-06-20 18:14:33,849 INFO [train.py:996] (0/4) Epoch 4, batch 10450, loss[loss=0.232, simple_loss=0.3109, pruned_loss=0.07656, over 21401.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3079, pruned_loss=0.08334, over 4256558.72 frames. ], batch size: 211, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:14:34,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=611604.0, ans=0.125 2023-06-20 18:16:43,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=611844.0, ans=0.05 2023-06-20 18:16:49,365 INFO [train.py:996] (0/4) Epoch 4, batch 10500, loss[loss=0.2327, simple_loss=0.2953, pruned_loss=0.08501, over 21223.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3076, pruned_loss=0.08145, over 4250097.71 frames. ], batch size: 176, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:17:15,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=611964.0, ans=0.07 2023-06-20 18:17:25,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.554e+02 2.960e+02 3.444e+02 4.861e+02, threshold=5.921e+02, percent-clipped=0.0 2023-06-20 18:17:45,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=612084.0, ans=0.0 2023-06-20 18:17:48,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=612084.0, ans=12.0 2023-06-20 18:18:23,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=612144.0, ans=0.0 2023-06-20 18:18:27,192 INFO [train.py:996] (0/4) Epoch 4, batch 10550, loss[loss=0.1644, simple_loss=0.2322, pruned_loss=0.04827, over 15213.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3032, pruned_loss=0.08137, over 4235202.44 frames. ], batch size: 60, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:18:30,803 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:19:05,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=612324.0, ans=0.0 2023-06-20 18:19:11,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=612324.0, ans=0.2 2023-06-20 18:19:14,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=612324.0, ans=0.125 2023-06-20 18:19:24,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-20 18:19:28,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-20 18:19:55,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=612444.0, ans=0.0 2023-06-20 18:20:06,500 INFO [train.py:996] (0/4) Epoch 4, batch 10600, loss[loss=0.1879, simple_loss=0.2601, pruned_loss=0.0578, over 21299.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3, pruned_loss=0.08058, over 4242194.11 frames. ], batch size: 131, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 18:20:26,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=612564.0, ans=0.0 2023-06-20 18:20:40,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=612564.0, ans=0.125 2023-06-20 18:20:42,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.468e+02 2.792e+02 3.375e+02 4.680e+02, threshold=5.585e+02, percent-clipped=0.0 2023-06-20 18:21:09,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=612684.0, ans=0.125 2023-06-20 18:21:14,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=612684.0, ans=0.0 2023-06-20 18:21:51,934 INFO [train.py:996] (0/4) Epoch 4, batch 10650, loss[loss=0.1754, simple_loss=0.2525, pruned_loss=0.04914, over 21394.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3, pruned_loss=0.07871, over 4251016.80 frames. ], batch size: 211, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:22:18,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=612864.0, ans=0.125 2023-06-20 18:23:01,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612984.0, ans=0.1 2023-06-20 18:23:05,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612984.0, ans=0.1 2023-06-20 18:23:43,990 INFO [train.py:996] (0/4) Epoch 4, batch 10700, loss[loss=0.3122, simple_loss=0.3684, pruned_loss=0.128, over 21415.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3017, pruned_loss=0.07973, over 4249678.06 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:24:11,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=613164.0, ans=0.125 2023-06-20 18:24:25,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.621e+02 3.333e+02 3.923e+02 6.693e+02, threshold=6.666e+02, percent-clipped=4.0 2023-06-20 18:25:01,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=613344.0, ans=0.0 2023-06-20 18:25:09,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=613344.0, ans=0.125 2023-06-20 18:25:10,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=613344.0, ans=0.025 2023-06-20 18:25:22,499 INFO [train.py:996] (0/4) Epoch 4, batch 10750, loss[loss=0.2554, simple_loss=0.3542, pruned_loss=0.07829, over 21763.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3103, pruned_loss=0.08364, over 4257614.83 frames. ], batch size: 351, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:27:06,305 INFO [train.py:996] (0/4) Epoch 4, batch 10800, loss[loss=0.2749, simple_loss=0.3394, pruned_loss=0.1052, over 21387.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3164, pruned_loss=0.0848, over 4258136.95 frames. ], batch size: 211, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:28:07,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.553e+02 2.986e+02 3.356e+02 5.834e+02, threshold=5.972e+02, percent-clipped=0.0 2023-06-20 18:29:04,045 INFO [train.py:996] (0/4) Epoch 4, batch 10850, loss[loss=0.2306, simple_loss=0.3053, pruned_loss=0.07799, over 21776.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.317, pruned_loss=0.08534, over 4264641.86 frames. ], batch size: 352, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 18:29:16,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.83 vs. limit=10.0 2023-06-20 18:29:24,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=614064.0, ans=0.0 2023-06-20 18:29:26,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=22.5 2023-06-20 18:30:06,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-20 18:30:08,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=614184.0, ans=0.125 2023-06-20 18:30:11,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=614184.0, ans=0.0 2023-06-20 18:30:37,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=614244.0, ans=0.1 2023-06-20 18:30:41,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=614304.0, ans=0.125 2023-06-20 18:30:48,127 INFO [train.py:996] (0/4) Epoch 4, batch 10900, loss[loss=0.2244, simple_loss=0.3173, pruned_loss=0.06578, over 21690.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3094, pruned_loss=0.08285, over 4268828.55 frames. ], batch size: 298, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 18:30:53,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=614304.0, ans=0.0 2023-06-20 18:31:08,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=614364.0, ans=0.125 2023-06-20 18:31:33,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.427e+02 2.861e+02 3.280e+02 5.229e+02, threshold=5.723e+02, percent-clipped=0.0 2023-06-20 18:31:46,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=614424.0, ans=0.125 2023-06-20 18:31:52,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-20 18:32:30,223 INFO [train.py:996] (0/4) Epoch 4, batch 10950, loss[loss=0.2282, simple_loss=0.2917, pruned_loss=0.08236, over 21885.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3052, pruned_loss=0.08081, over 4271302.74 frames. ], batch size: 107, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 18:32:30,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=614604.0, ans=0.125 2023-06-20 18:32:54,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=614664.0, ans=0.125 2023-06-20 18:33:11,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=614724.0, ans=0.0 2023-06-20 18:33:16,329 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-20 18:33:31,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=614784.0, ans=0.0 2023-06-20 18:33:35,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=614784.0, ans=0.2 2023-06-20 18:33:40,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-20 18:34:07,234 INFO [train.py:996] (0/4) Epoch 4, batch 11000, loss[loss=0.2248, simple_loss=0.2873, pruned_loss=0.08115, over 21688.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3031, pruned_loss=0.08175, over 4270093.16 frames. ], batch size: 230, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:34:25,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-20 18:35:02,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.436e+02 2.740e+02 3.123e+02 4.405e+02, threshold=5.481e+02, percent-clipped=0.0 2023-06-20 18:35:09,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=615024.0, ans=0.09899494936611666 2023-06-20 18:35:21,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=615084.0, ans=0.0 2023-06-20 18:35:45,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=22.5 2023-06-20 18:35:56,964 INFO [train.py:996] (0/4) Epoch 4, batch 11050, loss[loss=0.2058, simple_loss=0.2698, pruned_loss=0.0709, over 21613.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3015, pruned_loss=0.08319, over 4264083.40 frames. ], batch size: 264, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:35:58,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615204.0, ans=0.1 2023-06-20 18:36:00,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=615204.0, ans=0.2 2023-06-20 18:36:14,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-20 18:37:13,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-20 18:37:33,927 INFO [train.py:996] (0/4) Epoch 4, batch 11100, loss[loss=0.2285, simple_loss=0.2965, pruned_loss=0.08022, over 21414.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3026, pruned_loss=0.08345, over 4253132.37 frames. ], batch size: 194, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 18:38:10,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=615624.0, ans=0.0 2023-06-20 18:38:11,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.701e+02 3.050e+02 3.889e+02 7.267e+02, threshold=6.099e+02, percent-clipped=1.0 2023-06-20 18:38:26,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=615624.0, ans=0.1 2023-06-20 18:38:35,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=615684.0, ans=0.125 2023-06-20 18:38:41,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=615684.0, ans=0.05 2023-06-20 18:39:11,206 INFO [train.py:996] (0/4) Epoch 4, batch 11150, loss[loss=0.2676, simple_loss=0.3194, pruned_loss=0.1079, over 21298.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.301, pruned_loss=0.08292, over 4256562.93 frames. ], batch size: 471, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 18:39:20,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615804.0, ans=0.1 2023-06-20 18:39:47,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=615924.0, ans=0.125 2023-06-20 18:40:10,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615984.0, ans=0.1 2023-06-20 18:40:43,580 INFO [train.py:996] (0/4) Epoch 4, batch 11200, loss[loss=0.1986, simple_loss=0.2717, pruned_loss=0.06277, over 21648.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2983, pruned_loss=0.08169, over 4250343.69 frames. ], batch size: 282, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:41:20,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.501e+02 3.007e+02 3.463e+02 5.262e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-20 18:41:22,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=616224.0, ans=0.07 2023-06-20 18:41:47,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=616284.0, ans=0.2 2023-06-20 18:41:50,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=616284.0, ans=0.09899494936611666 2023-06-20 18:41:51,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=616284.0, ans=0.125 2023-06-20 18:42:00,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=616344.0, ans=0.125 2023-06-20 18:42:19,908 INFO [train.py:996] (0/4) Epoch 4, batch 11250, loss[loss=0.2432, simple_loss=0.3174, pruned_loss=0.08446, over 21798.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2982, pruned_loss=0.082, over 4256961.37 frames. ], batch size: 118, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:42:30,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-20 18:42:39,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=616464.0, ans=0.125 2023-06-20 18:42:42,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-20 18:43:04,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=616524.0, ans=0.125 2023-06-20 18:43:11,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-20 18:43:15,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=616584.0, ans=0.0 2023-06-20 18:43:29,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-20 18:43:55,851 INFO [train.py:996] (0/4) Epoch 4, batch 11300, loss[loss=0.2243, simple_loss=0.3012, pruned_loss=0.07368, over 21981.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2998, pruned_loss=0.08266, over 4254272.72 frames. ], batch size: 373, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:44:17,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=616764.0, ans=0.0 2023-06-20 18:44:32,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.391e+02 2.715e+02 3.266e+02 4.844e+02, threshold=5.429e+02, percent-clipped=0.0 2023-06-20 18:44:56,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=15.0 2023-06-20 18:44:58,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=616884.0, ans=0.0 2023-06-20 18:45:06,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-20 18:45:08,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-20 18:45:32,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617004.0, ans=0.1 2023-06-20 18:45:33,081 INFO [train.py:996] (0/4) Epoch 4, batch 11350, loss[loss=0.318, simple_loss=0.3772, pruned_loss=0.1294, over 21714.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3017, pruned_loss=0.08229, over 4263990.86 frames. ], batch size: 441, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 18:47:05,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=617184.0, ans=0.125 2023-06-20 18:47:20,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617244.0, ans=0.1 2023-06-20 18:47:26,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=617304.0, ans=0.125 2023-06-20 18:47:27,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-20 18:47:27,858 INFO [train.py:996] (0/4) Epoch 4, batch 11400, loss[loss=0.2318, simple_loss=0.3221, pruned_loss=0.07081, over 21749.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3071, pruned_loss=0.08455, over 4268488.79 frames. ], batch size: 332, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:48:05,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-20 18:48:05,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=617364.0, ans=0.125 2023-06-20 18:48:09,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.547e+02 3.001e+02 3.571e+02 5.767e+02, threshold=6.003e+02, percent-clipped=1.0 2023-06-20 18:48:36,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-20 18:49:00,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-20 18:49:01,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=617484.0, ans=0.0 2023-06-20 18:49:20,891 INFO [train.py:996] (0/4) Epoch 4, batch 11450, loss[loss=0.2439, simple_loss=0.324, pruned_loss=0.08186, over 21464.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.307, pruned_loss=0.08314, over 4266989.49 frames. ], batch size: 211, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:49:31,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=617604.0, ans=0.0 2023-06-20 18:50:04,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=617724.0, ans=0.0 2023-06-20 18:50:32,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=617784.0, ans=0.04949747468305833 2023-06-20 18:50:35,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=617784.0, ans=0.125 2023-06-20 18:50:48,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=617844.0, ans=0.125 2023-06-20 18:51:09,829 INFO [train.py:996] (0/4) Epoch 4, batch 11500, loss[loss=0.2214, simple_loss=0.3172, pruned_loss=0.06283, over 21804.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3095, pruned_loss=0.08386, over 4270504.95 frames. ], batch size: 282, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:51:40,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=617964.0, ans=0.125 2023-06-20 18:51:40,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=617964.0, ans=0.0 2023-06-20 18:51:54,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.475e+02 2.866e+02 3.336e+02 5.251e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-20 18:52:58,001 INFO [train.py:996] (0/4) Epoch 4, batch 11550, loss[loss=0.2732, simple_loss=0.3663, pruned_loss=0.09004, over 21735.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.316, pruned_loss=0.08352, over 4275397.77 frames. ], batch size: 351, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:53:41,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=618264.0, ans=0.05 2023-06-20 18:54:05,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=618324.0, ans=0.0 2023-06-20 18:54:56,645 INFO [train.py:996] (0/4) Epoch 4, batch 11600, loss[loss=0.2794, simple_loss=0.3784, pruned_loss=0.09023, over 21663.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3315, pruned_loss=0.08584, over 4273584.57 frames. ], batch size: 247, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 18:54:58,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=618504.0, ans=0.0 2023-06-20 18:55:32,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=618564.0, ans=0.125 2023-06-20 18:55:40,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.611e+02 3.024e+02 3.614e+02 6.438e+02, threshold=6.048e+02, percent-clipped=2.0 2023-06-20 18:55:40,895 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:56:08,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=618684.0, ans=0.0 2023-06-20 18:56:34,569 INFO [train.py:996] (0/4) Epoch 4, batch 11650, loss[loss=0.3178, simple_loss=0.4151, pruned_loss=0.1103, over 21653.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3379, pruned_loss=0.08617, over 4278965.45 frames. ], batch size: 414, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 18:56:41,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=618804.0, ans=0.125 2023-06-20 18:57:12,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=618864.0, ans=0.1 2023-06-20 18:57:36,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=618984.0, ans=0.0 2023-06-20 18:57:58,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-20 18:58:01,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=619044.0, ans=0.125 2023-06-20 18:58:08,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=619044.0, ans=0.2 2023-06-20 18:58:10,737 INFO [train.py:996] (0/4) Epoch 4, batch 11700, loss[loss=0.2192, simple_loss=0.2765, pruned_loss=0.08089, over 21755.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3281, pruned_loss=0.08626, over 4282889.43 frames. ], batch size: 102, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 18:58:39,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=619164.0, ans=15.0 2023-06-20 18:58:52,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.548e+02 2.868e+02 3.351e+02 5.665e+02, threshold=5.736e+02, percent-clipped=0.0 2023-06-20 18:59:29,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-20 18:59:30,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=619284.0, ans=0.125 2023-06-20 18:59:45,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=619344.0, ans=0.125 2023-06-20 18:59:56,746 INFO [train.py:996] (0/4) Epoch 4, batch 11750, loss[loss=0.2501, simple_loss=0.3055, pruned_loss=0.09732, over 21276.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3196, pruned_loss=0.0867, over 4282092.87 frames. ], batch size: 176, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:00:27,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=619464.0, ans=0.2 2023-06-20 19:01:31,448 INFO [train.py:996] (0/4) Epoch 4, batch 11800, loss[loss=0.2895, simple_loss=0.3532, pruned_loss=0.1129, over 21416.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3227, pruned_loss=0.08911, over 4275439.46 frames. ], batch size: 159, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:02:13,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-06-20 19:02:13,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.675e+02 3.120e+02 4.087e+02 6.326e+02, threshold=6.239e+02, percent-clipped=4.0 2023-06-20 19:02:24,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=619824.0, ans=0.07 2023-06-20 19:03:02,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=619944.0, ans=0.125 2023-06-20 19:03:05,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=619944.0, ans=0.0 2023-06-20 19:03:08,262 INFO [train.py:996] (0/4) Epoch 4, batch 11850, loss[loss=0.2453, simple_loss=0.329, pruned_loss=0.08084, over 21461.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3233, pruned_loss=0.08749, over 4277613.21 frames. ], batch size: 548, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 19:03:17,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620004.0, ans=0.1 2023-06-20 19:03:38,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=620064.0, ans=0.0 2023-06-20 19:04:10,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=15.0 2023-06-20 19:04:25,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=620184.0, ans=0.0 2023-06-20 19:04:31,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=620244.0, ans=0.2 2023-06-20 19:04:58,308 INFO [train.py:996] (0/4) Epoch 4, batch 11900, loss[loss=0.2523, simple_loss=0.353, pruned_loss=0.07583, over 21647.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.324, pruned_loss=0.08583, over 4277795.77 frames. ], batch size: 441, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:05:09,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=620304.0, ans=0.125 2023-06-20 19:05:25,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=620364.0, ans=0.125 2023-06-20 19:05:36,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.326e+02 2.650e+02 3.050e+02 4.543e+02, threshold=5.300e+02, percent-clipped=0.0 2023-06-20 19:05:45,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=620424.0, ans=0.0 2023-06-20 19:05:50,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=620424.0, ans=0.125 2023-06-20 19:05:58,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=620484.0, ans=0.125 2023-06-20 19:06:10,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620544.0, ans=0.1 2023-06-20 19:06:10,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=620544.0, ans=0.0 2023-06-20 19:06:37,623 INFO [train.py:996] (0/4) Epoch 4, batch 11950, loss[loss=0.217, simple_loss=0.3328, pruned_loss=0.05058, over 19877.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3242, pruned_loss=0.08343, over 4266231.83 frames. ], batch size: 702, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:06:45,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=620604.0, ans=0.0 2023-06-20 19:06:51,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=620664.0, ans=0.125 2023-06-20 19:07:01,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=620664.0, ans=0.05 2023-06-20 19:07:07,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620664.0, ans=0.1 2023-06-20 19:07:08,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-20 19:07:24,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-20 19:08:06,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=620844.0, ans=0.2 2023-06-20 19:08:14,762 INFO [train.py:996] (0/4) Epoch 4, batch 12000, loss[loss=0.211, simple_loss=0.2702, pruned_loss=0.07594, over 21745.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3198, pruned_loss=0.08143, over 4261021.60 frames. ], batch size: 124, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:08:14,763 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 19:09:03,789 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2647, simple_loss=0.362, pruned_loss=0.08364, over 1796401.00 frames. 2023-06-20 19:09:03,790 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 19:09:32,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=620964.0, ans=0.125 2023-06-20 19:09:40,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-20 19:09:40,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-20 19:09:47,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.456e+02 3.065e+02 4.141e+02 8.942e+02, threshold=6.129e+02, percent-clipped=11.0 2023-06-20 19:10:42,509 INFO [train.py:996] (0/4) Epoch 4, batch 12050, loss[loss=0.2816, simple_loss=0.348, pruned_loss=0.1076, over 21870.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3143, pruned_loss=0.08236, over 4267508.12 frames. ], batch size: 118, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:10:55,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=621204.0, ans=0.125 2023-06-20 19:11:21,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=621264.0, ans=0.125 2023-06-20 19:11:38,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=621324.0, ans=0.0 2023-06-20 19:11:54,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=621384.0, ans=0.125 2023-06-20 19:12:15,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=621444.0, ans=0.125 2023-06-20 19:12:21,129 INFO [train.py:996] (0/4) Epoch 4, batch 12100, loss[loss=0.2978, simple_loss=0.3596, pruned_loss=0.118, over 21397.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3193, pruned_loss=0.08633, over 4275347.91 frames. ], batch size: 548, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:12:40,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=621504.0, ans=0.1 2023-06-20 19:12:47,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2023-06-20 19:13:09,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.714e+02 2.937e+02 3.590e+02 6.628e+02, threshold=5.874e+02, percent-clipped=1.0 2023-06-20 19:13:22,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=621624.0, ans=0.2 2023-06-20 19:13:33,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-20 19:14:10,379 INFO [train.py:996] (0/4) Epoch 4, batch 12150, loss[loss=0.2489, simple_loss=0.3421, pruned_loss=0.07781, over 21782.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3221, pruned_loss=0.08616, over 4271339.10 frames. ], batch size: 332, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 19:14:16,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=621804.0, ans=0.0 2023-06-20 19:14:20,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=621804.0, ans=0.125 2023-06-20 19:15:32,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=622044.0, ans=0.0 2023-06-20 19:15:47,743 INFO [train.py:996] (0/4) Epoch 4, batch 12200, loss[loss=0.2119, simple_loss=0.2716, pruned_loss=0.07608, over 21492.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3206, pruned_loss=0.08611, over 4274010.09 frames. ], batch size: 212, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:15:52,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=622104.0, ans=0.125 2023-06-20 19:15:54,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.12 vs. limit=12.0 2023-06-20 19:15:58,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=622104.0, ans=0.2 2023-06-20 19:16:20,486 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:16:30,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.592e+02 3.097e+02 3.968e+02 7.788e+02, threshold=6.193e+02, percent-clipped=3.0 2023-06-20 19:16:52,903 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:16:54,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-20 19:17:15,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.17 vs. limit=10.0 2023-06-20 19:17:18,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=622344.0, ans=0.2 2023-06-20 19:17:22,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=622344.0, ans=0.0 2023-06-20 19:17:25,142 INFO [train.py:996] (0/4) Epoch 4, batch 12250, loss[loss=0.1839, simple_loss=0.2592, pruned_loss=0.05427, over 21525.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3128, pruned_loss=0.08267, over 4274810.53 frames. ], batch size: 230, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:17:36,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-20 19:18:00,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622464.0, ans=0.1 2023-06-20 19:18:27,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=622584.0, ans=0.0 2023-06-20 19:18:33,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=622584.0, ans=0.125 2023-06-20 19:19:00,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=622644.0, ans=0.2 2023-06-20 19:19:02,396 INFO [train.py:996] (0/4) Epoch 4, batch 12300, loss[loss=0.243, simple_loss=0.3329, pruned_loss=0.07655, over 21746.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.304, pruned_loss=0.07645, over 4281770.02 frames. ], batch size: 332, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:19:26,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=622764.0, ans=0.0 2023-06-20 19:19:45,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 2.123e+02 2.583e+02 3.049e+02 4.453e+02, threshold=5.165e+02, percent-clipped=0.0 2023-06-20 19:19:47,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=622824.0, ans=0.125 2023-06-20 19:19:53,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=622824.0, ans=0.125 2023-06-20 19:19:55,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=622824.0, ans=0.0 2023-06-20 19:20:01,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=622884.0, ans=0.0 2023-06-20 19:20:09,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=622884.0, ans=0.125 2023-06-20 19:20:42,916 INFO [train.py:996] (0/4) Epoch 4, batch 12350, loss[loss=0.2577, simple_loss=0.3298, pruned_loss=0.09279, over 21869.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3089, pruned_loss=0.0777, over 4282373.10 frames. ], batch size: 107, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:20:55,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=623004.0, ans=0.125 2023-06-20 19:21:04,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=623064.0, ans=0.125 2023-06-20 19:21:17,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=623064.0, ans=0.04949747468305833 2023-06-20 19:21:39,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=623184.0, ans=10.0 2023-06-20 19:21:54,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=623244.0, ans=0.125 2023-06-20 19:21:56,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-20 19:22:18,920 INFO [train.py:996] (0/4) Epoch 4, batch 12400, loss[loss=0.2533, simple_loss=0.3076, pruned_loss=0.09952, over 21312.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3107, pruned_loss=0.08118, over 4288332.87 frames. ], batch size: 159, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 19:22:58,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.37 vs. limit=15.0 2023-06-20 19:23:01,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.547e+02 2.868e+02 3.412e+02 7.340e+02, threshold=5.736e+02, percent-clipped=3.0 2023-06-20 19:23:19,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=623484.0, ans=0.125 2023-06-20 19:23:57,376 INFO [train.py:996] (0/4) Epoch 4, batch 12450, loss[loss=0.2906, simple_loss=0.359, pruned_loss=0.1112, over 21305.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3147, pruned_loss=0.0845, over 4283470.51 frames. ], batch size: 143, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:24:21,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623604.0, ans=0.1 2023-06-20 19:24:32,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=623664.0, ans=0.0 2023-06-20 19:24:56,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=623724.0, ans=0.0 2023-06-20 19:25:13,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=623784.0, ans=0.125 2023-06-20 19:25:16,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-20 19:25:47,473 INFO [train.py:996] (0/4) Epoch 4, batch 12500, loss[loss=0.2244, simple_loss=0.2618, pruned_loss=0.09348, over 20179.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3261, pruned_loss=0.08938, over 4284904.74 frames. ], batch size: 703, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:25:48,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=623904.0, ans=0.0 2023-06-20 19:26:10,168 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-104000.pt 2023-06-20 19:26:14,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=623964.0, ans=0.125 2023-06-20 19:26:18,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=623964.0, ans=0.125 2023-06-20 19:26:28,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.938e+02 3.235e+02 3.820e+02 6.603e+02, threshold=6.470e+02, percent-clipped=1.0 2023-06-20 19:26:44,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-20 19:26:46,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=624084.0, ans=0.0 2023-06-20 19:27:37,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=624144.0, ans=10.0 2023-06-20 19:27:43,432 INFO [train.py:996] (0/4) Epoch 4, batch 12550, loss[loss=0.2814, simple_loss=0.3607, pruned_loss=0.1011, over 21725.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.333, pruned_loss=0.09209, over 4288712.10 frames. ], batch size: 441, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:28:19,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=624264.0, ans=0.025 2023-06-20 19:28:33,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624324.0, ans=0.1 2023-06-20 19:28:39,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=624324.0, ans=0.0 2023-06-20 19:29:24,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-20 19:29:38,571 INFO [train.py:996] (0/4) Epoch 4, batch 12600, loss[loss=0.1821, simple_loss=0.2509, pruned_loss=0.05666, over 21863.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3305, pruned_loss=0.08902, over 4281011.86 frames. ], batch size: 107, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:29:43,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=624504.0, ans=0.2 2023-06-20 19:30:01,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=624564.0, ans=0.95 2023-06-20 19:30:26,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624624.0, ans=0.1 2023-06-20 19:30:29,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.456e+02 2.778e+02 3.111e+02 4.488e+02, threshold=5.555e+02, percent-clipped=0.0 2023-06-20 19:30:36,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624624.0, ans=0.0 2023-06-20 19:31:07,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.18 vs. limit=6.0 2023-06-20 19:31:23,385 INFO [train.py:996] (0/4) Epoch 4, batch 12650, loss[loss=0.2323, simple_loss=0.2978, pruned_loss=0.08343, over 21793.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3222, pruned_loss=0.08475, over 4282466.93 frames. ], batch size: 247, lr: 8.07e-03, grad_scale: 32.0 2023-06-20 19:31:51,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-20 19:32:19,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-06-20 19:32:37,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=624984.0, ans=0.125 2023-06-20 19:32:49,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=625044.0, ans=0.0 2023-06-20 19:33:06,885 INFO [train.py:996] (0/4) Epoch 4, batch 12700, loss[loss=0.2813, simple_loss=0.3464, pruned_loss=0.1081, over 21637.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3221, pruned_loss=0.08747, over 4281494.76 frames. ], batch size: 389, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 19:33:11,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-20 19:33:19,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625104.0, ans=0.1 2023-06-20 19:33:23,641 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:33:51,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.673e+02 3.135e+02 3.667e+02 6.631e+02, threshold=6.269e+02, percent-clipped=1.0 2023-06-20 19:34:05,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=625224.0, ans=0.1 2023-06-20 19:34:22,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=625344.0, ans=10.0 2023-06-20 19:34:23,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=625344.0, ans=0.0 2023-06-20 19:34:40,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=625344.0, ans=0.125 2023-06-20 19:34:42,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=15.0 2023-06-20 19:34:43,217 INFO [train.py:996] (0/4) Epoch 4, batch 12750, loss[loss=0.2996, simple_loss=0.3568, pruned_loss=0.1212, over 21675.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3251, pruned_loss=0.08887, over 4286640.84 frames. ], batch size: 508, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 19:35:05,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-20 19:35:14,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=625464.0, ans=0.0 2023-06-20 19:35:47,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=625524.0, ans=0.2 2023-06-20 19:35:54,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=625524.0, ans=0.2 2023-06-20 19:35:59,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=625584.0, ans=0.1 2023-06-20 19:36:32,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=625644.0, ans=10.0 2023-06-20 19:36:34,897 INFO [train.py:996] (0/4) Epoch 4, batch 12800, loss[loss=0.2865, simple_loss=0.3476, pruned_loss=0.1127, over 21766.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3232, pruned_loss=0.08896, over 4287595.61 frames. ], batch size: 441, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:36:59,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=625764.0, ans=0.2 2023-06-20 19:37:23,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.419e+02 2.676e+02 3.175e+02 5.760e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-20 19:37:37,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=625884.0, ans=0.125 2023-06-20 19:38:03,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=625944.0, ans=0.04949747468305833 2023-06-20 19:38:37,383 INFO [train.py:996] (0/4) Epoch 4, batch 12850, loss[loss=0.2426, simple_loss=0.3165, pruned_loss=0.08434, over 21477.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3251, pruned_loss=0.09015, over 4283000.42 frames. ], batch size: 131, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:38:38,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-20 19:39:15,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=626064.0, ans=0.125 2023-06-20 19:39:23,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=626124.0, ans=0.0 2023-06-20 19:40:23,450 INFO [train.py:996] (0/4) Epoch 4, batch 12900, loss[loss=0.2011, simple_loss=0.2721, pruned_loss=0.06508, over 21188.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.321, pruned_loss=0.08616, over 4276127.18 frames. ], batch size: 159, lr: 8.06e-03, grad_scale: 32.0 2023-06-20 19:40:29,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=626304.0, ans=0.0 2023-06-20 19:40:58,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=626424.0, ans=0.04949747468305833 2023-06-20 19:41:02,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.262e+02 2.706e+02 3.188e+02 4.993e+02, threshold=5.411e+02, percent-clipped=0.0 2023-06-20 19:41:24,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-20 19:41:45,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=15.0 2023-06-20 19:41:58,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-20 19:42:13,996 INFO [train.py:996] (0/4) Epoch 4, batch 12950, loss[loss=0.2282, simple_loss=0.2954, pruned_loss=0.08047, over 21365.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3197, pruned_loss=0.08417, over 4281414.69 frames. ], batch size: 194, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:42:14,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=626604.0, ans=0.125 2023-06-20 19:42:37,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=626664.0, ans=0.2 2023-06-20 19:43:49,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=626844.0, ans=0.125 2023-06-20 19:43:52,076 INFO [train.py:996] (0/4) Epoch 4, batch 13000, loss[loss=0.1788, simple_loss=0.2612, pruned_loss=0.04823, over 21819.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3209, pruned_loss=0.0838, over 4272746.59 frames. ], batch size: 282, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:44:05,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=626904.0, ans=0.1 2023-06-20 19:44:30,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.289e+02 2.689e+02 3.180e+02 4.201e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-20 19:45:08,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627144.0, ans=0.1 2023-06-20 19:45:15,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2023-06-20 19:45:29,789 INFO [train.py:996] (0/4) Epoch 4, batch 13050, loss[loss=0.2277, simple_loss=0.3006, pruned_loss=0.07741, over 21460.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3145, pruned_loss=0.08104, over 4270440.39 frames. ], batch size: 194, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:45:56,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=627264.0, ans=0.0 2023-06-20 19:46:49,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=627384.0, ans=0.0 2023-06-20 19:47:31,044 INFO [train.py:996] (0/4) Epoch 4, batch 13100, loss[loss=0.2425, simple_loss=0.3281, pruned_loss=0.07844, over 21697.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3168, pruned_loss=0.08145, over 4276652.09 frames. ], batch size: 441, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:47:50,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=627564.0, ans=0.0 2023-06-20 19:47:54,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=627564.0, ans=0.0 2023-06-20 19:48:05,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-20 19:48:16,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 2.575e+02 2.978e+02 3.531e+02 5.580e+02, threshold=5.955e+02, percent-clipped=1.0 2023-06-20 19:48:20,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-20 19:48:21,506 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:49:01,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=627744.0, ans=0.07 2023-06-20 19:49:09,612 INFO [train.py:996] (0/4) Epoch 4, batch 13150, loss[loss=0.2093, simple_loss=0.282, pruned_loss=0.06833, over 21658.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3194, pruned_loss=0.08501, over 4271212.56 frames. ], batch size: 247, lr: 8.05e-03, grad_scale: 32.0 2023-06-20 19:49:10,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-20 19:49:43,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=627864.0, ans=0.125 2023-06-20 19:49:51,096 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:50:11,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=627984.0, ans=0.125 2023-06-20 19:50:42,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=628044.0, ans=0.125 2023-06-20 19:50:43,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=15.0 2023-06-20 19:51:05,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=628044.0, ans=0.125 2023-06-20 19:51:09,810 INFO [train.py:996] (0/4) Epoch 4, batch 13200, loss[loss=0.2867, simple_loss=0.3539, pruned_loss=0.1098, over 21436.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3176, pruned_loss=0.08522, over 4278626.56 frames. ], batch size: 131, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:51:17,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=628104.0, ans=0.0 2023-06-20 19:51:54,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.426e+02 2.747e+02 3.167e+02 4.367e+02, threshold=5.495e+02, percent-clipped=0.0 2023-06-20 19:52:48,945 INFO [train.py:996] (0/4) Epoch 4, batch 13250, loss[loss=0.2459, simple_loss=0.3061, pruned_loss=0.09283, over 21439.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3183, pruned_loss=0.08609, over 4272110.36 frames. ], batch size: 548, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:52:49,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=628404.0, ans=0.125 2023-06-20 19:52:51,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.85 vs. limit=10.0 2023-06-20 19:53:14,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-20 19:54:08,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=628584.0, ans=0.2 2023-06-20 19:54:38,769 INFO [train.py:996] (0/4) Epoch 4, batch 13300, loss[loss=0.267, simple_loss=0.3387, pruned_loss=0.09766, over 21777.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3202, pruned_loss=0.08632, over 4272865.13 frames. ], batch size: 118, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:54:39,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628704.0, ans=0.1 2023-06-20 19:54:50,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=628704.0, ans=0.125 2023-06-20 19:55:22,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=628824.0, ans=0.1 2023-06-20 19:55:23,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.499e+02 2.765e+02 3.137e+02 6.068e+02, threshold=5.530e+02, percent-clipped=1.0 2023-06-20 19:55:32,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=628824.0, ans=0.0 2023-06-20 19:55:35,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=628884.0, ans=0.125 2023-06-20 19:55:46,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=628884.0, ans=0.125 2023-06-20 19:56:02,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=628944.0, ans=0.0 2023-06-20 19:56:17,427 INFO [train.py:996] (0/4) Epoch 4, batch 13350, loss[loss=0.268, simple_loss=0.3489, pruned_loss=0.09355, over 21708.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3247, pruned_loss=0.08907, over 4275001.01 frames. ], batch size: 332, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:57:03,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=629124.0, ans=0.0 2023-06-20 19:57:03,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=629124.0, ans=0.07 2023-06-20 19:57:13,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=629184.0, ans=0.5 2023-06-20 19:57:54,438 INFO [train.py:996] (0/4) Epoch 4, batch 13400, loss[loss=0.245, simple_loss=0.3117, pruned_loss=0.08914, over 21573.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3256, pruned_loss=0.09021, over 4271088.73 frames. ], batch size: 131, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 19:58:27,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=629364.0, ans=0.2 2023-06-20 19:58:38,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.621e+02 2.989e+02 3.422e+02 4.870e+02, threshold=5.978e+02, percent-clipped=0.0 2023-06-20 19:59:03,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-20 19:59:12,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=629484.0, ans=0.0 2023-06-20 19:59:37,649 INFO [train.py:996] (0/4) Epoch 4, batch 13450, loss[loss=0.2319, simple_loss=0.303, pruned_loss=0.08041, over 21726.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3279, pruned_loss=0.09311, over 4270359.41 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 20:00:50,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=629784.0, ans=0.125 2023-06-20 20:01:04,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=629844.0, ans=0.0 2023-06-20 20:01:08,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.01 vs. limit=15.0 2023-06-20 20:01:21,154 INFO [train.py:996] (0/4) Epoch 4, batch 13500, loss[loss=0.2262, simple_loss=0.289, pruned_loss=0.08172, over 21581.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3189, pruned_loss=0.09022, over 4273066.03 frames. ], batch size: 230, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:01:29,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=629904.0, ans=0.09899494936611666 2023-06-20 20:01:55,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.636e+02 2.928e+02 3.414e+02 6.823e+02, threshold=5.856e+02, percent-clipped=1.0 2023-06-20 20:03:00,264 INFO [train.py:996] (0/4) Epoch 4, batch 13550, loss[loss=0.2709, simple_loss=0.3696, pruned_loss=0.08609, over 21762.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.321, pruned_loss=0.08927, over 4268960.42 frames. ], batch size: 282, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:03:02,392 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:03:17,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=630264.0, ans=0.2 2023-06-20 20:03:18,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.66 vs. limit=6.0 2023-06-20 20:03:25,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630264.0, ans=0.1 2023-06-20 20:03:27,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-20 20:03:29,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=630324.0, ans=0.125 2023-06-20 20:03:49,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=630324.0, ans=0.125 2023-06-20 20:04:12,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=630384.0, ans=0.2 2023-06-20 20:04:35,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=630444.0, ans=0.05 2023-06-20 20:04:37,767 INFO [train.py:996] (0/4) Epoch 4, batch 13600, loss[loss=0.2365, simple_loss=0.3025, pruned_loss=0.08524, over 21676.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3226, pruned_loss=0.08948, over 4266783.66 frames. ], batch size: 230, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 20:04:41,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-20 20:04:53,277 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:05:17,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.746e+02 3.307e+02 4.061e+02 7.657e+02, threshold=6.614e+02, percent-clipped=3.0 2023-06-20 20:05:26,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=630624.0, ans=0.0 2023-06-20 20:06:12,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-20 20:06:14,064 INFO [train.py:996] (0/4) Epoch 4, batch 13650, loss[loss=0.2226, simple_loss=0.2761, pruned_loss=0.08457, over 21137.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3163, pruned_loss=0.08597, over 4270875.95 frames. ], batch size: 143, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 20:07:10,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=630984.0, ans=0.2 2023-06-20 20:07:41,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=631044.0, ans=0.125 2023-06-20 20:07:44,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=631044.0, ans=0.0 2023-06-20 20:07:49,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-20 20:07:52,652 INFO [train.py:996] (0/4) Epoch 4, batch 13700, loss[loss=0.235, simple_loss=0.2952, pruned_loss=0.08742, over 21623.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3122, pruned_loss=0.08512, over 4266206.89 frames. ], batch size: 247, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 20:07:57,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.43 vs. limit=5.0 2023-06-20 20:08:39,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.840e+02 3.302e+02 4.432e+02 7.267e+02, threshold=6.603e+02, percent-clipped=2.0 2023-06-20 20:08:48,291 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:09:25,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=22.5 2023-06-20 20:09:30,627 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:09:30,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=631404.0, ans=0.125 2023-06-20 20:09:31,697 INFO [train.py:996] (0/4) Epoch 4, batch 13750, loss[loss=0.2169, simple_loss=0.2888, pruned_loss=0.0725, over 21766.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3099, pruned_loss=0.08362, over 4268676.78 frames. ], batch size: 282, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:09:33,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=631404.0, ans=0.1 2023-06-20 20:11:17,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=631704.0, ans=0.0 2023-06-20 20:11:17,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=631704.0, ans=0.125 2023-06-20 20:11:18,440 INFO [train.py:996] (0/4) Epoch 4, batch 13800, loss[loss=0.2425, simple_loss=0.3386, pruned_loss=0.07326, over 21615.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3169, pruned_loss=0.08338, over 4266060.24 frames. ], batch size: 263, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:11:33,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=631704.0, ans=0.0 2023-06-20 20:11:49,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=631764.0, ans=0.125 2023-06-20 20:12:12,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=631824.0, ans=0.125 2023-06-20 20:12:13,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=631824.0, ans=10.0 2023-06-20 20:12:15,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.546e+02 3.323e+02 4.159e+02 7.075e+02, threshold=6.647e+02, percent-clipped=2.0 2023-06-20 20:12:31,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=631884.0, ans=0.1 2023-06-20 20:12:34,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=631884.0, ans=0.125 2023-06-20 20:13:14,490 INFO [train.py:996] (0/4) Epoch 4, batch 13850, loss[loss=0.2187, simple_loss=0.2906, pruned_loss=0.07342, over 21875.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3217, pruned_loss=0.08408, over 4269293.60 frames. ], batch size: 107, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:13:47,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=632064.0, ans=0.0 2023-06-20 20:13:52,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=632064.0, ans=0.0 2023-06-20 20:14:07,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=632124.0, ans=0.1 2023-06-20 20:14:17,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=632184.0, ans=0.125 2023-06-20 20:14:24,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=632184.0, ans=0.125 2023-06-20 20:14:36,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=632244.0, ans=0.0 2023-06-20 20:14:53,321 INFO [train.py:996] (0/4) Epoch 4, batch 13900, loss[loss=0.2342, simple_loss=0.2959, pruned_loss=0.08626, over 21673.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3258, pruned_loss=0.08809, over 4270269.90 frames. ], batch size: 263, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:15:09,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-20 20:15:39,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.652e+02 3.240e+02 3.689e+02 5.944e+02, threshold=6.479e+02, percent-clipped=0.0 2023-06-20 20:15:45,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=632424.0, ans=0.125 2023-06-20 20:15:45,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=632424.0, ans=0.2 2023-06-20 20:15:48,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-20 20:15:54,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-20 20:16:37,007 INFO [train.py:996] (0/4) Epoch 4, batch 13950, loss[loss=0.2428, simple_loss=0.315, pruned_loss=0.08533, over 21487.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3262, pruned_loss=0.08985, over 4274629.87 frames. ], batch size: 131, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 20:16:42,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=632604.0, ans=0.125 2023-06-20 20:16:53,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=632604.0, ans=0.2 2023-06-20 20:17:00,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=632664.0, ans=0.125 2023-06-20 20:18:13,424 INFO [train.py:996] (0/4) Epoch 4, batch 14000, loss[loss=0.2243, simple_loss=0.3211, pruned_loss=0.06381, over 21794.00 frames. ], tot_loss[loss=0.249, simple_loss=0.323, pruned_loss=0.08752, over 4275096.31 frames. ], batch size: 332, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:18:53,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.412e+02 2.959e+02 3.758e+02 5.707e+02, threshold=5.918e+02, percent-clipped=0.0 2023-06-20 20:18:57,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=633024.0, ans=0.125 2023-06-20 20:18:57,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-20 20:19:50,093 INFO [train.py:996] (0/4) Epoch 4, batch 14050, loss[loss=0.2043, simple_loss=0.2622, pruned_loss=0.07317, over 21418.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3171, pruned_loss=0.08417, over 4262904.88 frames. ], batch size: 211, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:19:58,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=633204.0, ans=0.0 2023-06-20 20:20:07,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=633204.0, ans=0.125 2023-06-20 20:21:08,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=633444.0, ans=0.125 2023-06-20 20:21:19,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=633504.0, ans=0.0 2023-06-20 20:21:20,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-20 20:21:25,552 INFO [train.py:996] (0/4) Epoch 4, batch 14100, loss[loss=0.2207, simple_loss=0.2857, pruned_loss=0.0779, over 21540.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3106, pruned_loss=0.08319, over 4262968.89 frames. ], batch size: 263, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:22:05,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.333e+02 2.713e+02 3.192e+02 5.959e+02, threshold=5.427e+02, percent-clipped=1.0 2023-06-20 20:22:20,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=633684.0, ans=0.125 2023-06-20 20:22:22,386 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:22:55,923 INFO [train.py:996] (0/4) Epoch 4, batch 14150, loss[loss=0.2314, simple_loss=0.3172, pruned_loss=0.07277, over 21722.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3139, pruned_loss=0.0842, over 4263296.67 frames. ], batch size: 112, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:23:27,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=633864.0, ans=0.125 2023-06-20 20:23:30,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=633864.0, ans=0.125 2023-06-20 20:24:30,063 INFO [train.py:996] (0/4) Epoch 4, batch 14200, loss[loss=0.1984, simple_loss=0.284, pruned_loss=0.05638, over 21341.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3122, pruned_loss=0.08246, over 4265235.34 frames. ], batch size: 176, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 20:24:36,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=634104.0, ans=0.025 2023-06-20 20:25:15,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.204e+02 2.467e+02 2.859e+02 4.192e+02, threshold=4.934e+02, percent-clipped=0.0 2023-06-20 20:25:49,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=634344.0, ans=0.125 2023-06-20 20:26:00,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=634344.0, ans=0.125 2023-06-20 20:26:03,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=634344.0, ans=0.125 2023-06-20 20:26:06,335 INFO [train.py:996] (0/4) Epoch 4, batch 14250, loss[loss=0.2398, simple_loss=0.3153, pruned_loss=0.0822, over 21532.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3069, pruned_loss=0.08251, over 4262078.64 frames. ], batch size: 441, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:27:07,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=634584.0, ans=0.125 2023-06-20 20:27:50,752 INFO [train.py:996] (0/4) Epoch 4, batch 14300, loss[loss=0.2196, simple_loss=0.2871, pruned_loss=0.07603, over 21399.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3078, pruned_loss=0.08169, over 4267696.85 frames. ], batch size: 131, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:27:59,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=634704.0, ans=0.1 2023-06-20 20:28:38,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.547e+02 2.874e+02 3.518e+02 5.819e+02, threshold=5.747e+02, percent-clipped=3.0 2023-06-20 20:29:01,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-20 20:29:14,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=634944.0, ans=0.0 2023-06-20 20:29:28,329 INFO [train.py:996] (0/4) Epoch 4, batch 14350, loss[loss=0.21, simple_loss=0.287, pruned_loss=0.06648, over 21470.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.314, pruned_loss=0.08268, over 4255579.41 frames. ], batch size: 131, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:30:19,015 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:30:19,033 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:30:38,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=635244.0, ans=0.04949747468305833 2023-06-20 20:31:05,228 INFO [train.py:996] (0/4) Epoch 4, batch 14400, loss[loss=0.2595, simple_loss=0.3117, pruned_loss=0.1036, over 21692.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3106, pruned_loss=0.08307, over 4255084.36 frames. ], batch size: 414, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 20:31:11,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=635304.0, ans=0.125 2023-06-20 20:31:52,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=635424.0, ans=0.0 2023-06-20 20:31:53,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.383e+02 2.685e+02 3.261e+02 5.760e+02, threshold=5.369e+02, percent-clipped=1.0 2023-06-20 20:32:04,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=635484.0, ans=0.125 2023-06-20 20:32:14,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-20 20:32:41,697 INFO [train.py:996] (0/4) Epoch 4, batch 14450, loss[loss=0.2222, simple_loss=0.29, pruned_loss=0.07716, over 21790.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3074, pruned_loss=0.08427, over 4259185.56 frames. ], batch size: 124, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:32:48,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=635604.0, ans=0.0 2023-06-20 20:33:19,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=635724.0, ans=0.125 2023-06-20 20:34:00,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=635844.0, ans=0.125 2023-06-20 20:34:07,436 INFO [train.py:996] (0/4) Epoch 4, batch 14500, loss[loss=0.2161, simple_loss=0.3059, pruned_loss=0.06316, over 21744.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3039, pruned_loss=0.08316, over 4269778.20 frames. ], batch size: 282, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 20:34:46,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=635964.0, ans=0.125 2023-06-20 20:34:56,072 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.605e+02 3.187e+02 4.339e+02 7.236e+02, threshold=6.375e+02, percent-clipped=9.0 2023-06-20 20:34:56,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=636024.0, ans=0.125 2023-06-20 20:35:02,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=636084.0, ans=0.125 2023-06-20 20:35:15,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-20 20:35:41,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=636144.0, ans=0.125 2023-06-20 20:35:41,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=636144.0, ans=0.0 2023-06-20 20:35:45,554 INFO [train.py:996] (0/4) Epoch 4, batch 14550, loss[loss=0.2862, simple_loss=0.3479, pruned_loss=0.1123, over 21766.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3093, pruned_loss=0.08509, over 4276108.76 frames. ], batch size: 332, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:36:11,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=636264.0, ans=0.0 2023-06-20 20:36:40,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636324.0, ans=0.1 2023-06-20 20:37:30,046 INFO [train.py:996] (0/4) Epoch 4, batch 14600, loss[loss=0.2463, simple_loss=0.3319, pruned_loss=0.08031, over 21779.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3164, pruned_loss=0.08911, over 4272473.37 frames. ], batch size: 247, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:37:34,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636504.0, ans=0.125 2023-06-20 20:37:50,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=636564.0, ans=0.2 2023-06-20 20:38:01,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=636564.0, ans=0.125 2023-06-20 20:38:12,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.825e+02 3.267e+02 4.052e+02 6.777e+02, threshold=6.533e+02, percent-clipped=1.0 2023-06-20 20:39:00,570 INFO [train.py:996] (0/4) Epoch 4, batch 14650, loss[loss=0.1708, simple_loss=0.2559, pruned_loss=0.04287, over 21592.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3189, pruned_loss=0.08813, over 4272847.19 frames. ], batch size: 230, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:39:22,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=636804.0, ans=0.125 2023-06-20 20:39:28,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=636864.0, ans=0.035 2023-06-20 20:39:37,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=636864.0, ans=0.0 2023-06-20 20:39:47,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=636924.0, ans=0.125 2023-06-20 20:40:05,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=636984.0, ans=0.0 2023-06-20 20:40:13,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=636984.0, ans=0.015 2023-06-20 20:40:13,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=636984.0, ans=0.2 2023-06-20 20:40:25,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-20 20:40:26,280 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:40:31,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.74 vs. limit=6.0 2023-06-20 20:40:43,438 INFO [train.py:996] (0/4) Epoch 4, batch 14700, loss[loss=0.2377, simple_loss=0.3193, pruned_loss=0.07806, over 21525.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3128, pruned_loss=0.08149, over 4272976.58 frames. ], batch size: 508, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:40:44,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=637104.0, ans=0.5 2023-06-20 20:40:44,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=637104.0, ans=0.0 2023-06-20 20:40:58,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=637104.0, ans=0.125 2023-06-20 20:41:08,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-20 20:41:22,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-20 20:41:26,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.902e+02 2.312e+02 2.758e+02 4.493e+02, threshold=4.623e+02, percent-clipped=0.0 2023-06-20 20:41:45,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-20 20:41:46,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-20 20:41:56,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-20 20:42:11,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=637344.0, ans=0.04949747468305833 2023-06-20 20:42:27,588 INFO [train.py:996] (0/4) Epoch 4, batch 14750, loss[loss=0.2694, simple_loss=0.338, pruned_loss=0.1004, over 21492.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3194, pruned_loss=0.0851, over 4276596.88 frames. ], batch size: 194, lr: 7.99e-03, grad_scale: 16.0 2023-06-20 20:42:37,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=637404.0, ans=0.125 2023-06-20 20:42:42,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=637464.0, ans=0.125 2023-06-20 20:42:50,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=637464.0, ans=0.125 2023-06-20 20:43:59,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=637584.0, ans=0.125 2023-06-20 20:44:19,802 INFO [train.py:996] (0/4) Epoch 4, batch 14800, loss[loss=0.2373, simple_loss=0.3035, pruned_loss=0.08552, over 21822.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3313, pruned_loss=0.09118, over 4273968.35 frames. ], batch size: 107, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:44:24,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=637704.0, ans=0.125 2023-06-20 20:45:05,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 3.268e+02 3.887e+02 5.109e+02 8.215e+02, threshold=7.774e+02, percent-clipped=33.0 2023-06-20 20:45:08,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-20 20:45:31,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-20 20:45:43,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=637944.0, ans=0.125 2023-06-20 20:45:56,116 INFO [train.py:996] (0/4) Epoch 4, batch 14850, loss[loss=0.2124, simple_loss=0.2782, pruned_loss=0.07331, over 21859.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3239, pruned_loss=0.08992, over 4270675.90 frames. ], batch size: 107, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:47:05,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-20 20:47:15,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=638184.0, ans=0.125 2023-06-20 20:47:17,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=638184.0, ans=0.0 2023-06-20 20:47:18,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=638184.0, ans=0.0 2023-06-20 20:47:52,983 INFO [train.py:996] (0/4) Epoch 4, batch 14900, loss[loss=0.2711, simple_loss=0.3423, pruned_loss=0.09995, over 21386.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3282, pruned_loss=0.0925, over 4269574.09 frames. ], batch size: 549, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:48:05,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-20 20:48:07,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=638304.0, ans=0.1 2023-06-20 20:48:13,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=638364.0, ans=0.125 2023-06-20 20:48:40,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-20 20:48:58,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-20 20:49:02,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.913e+02 3.593e+02 4.421e+02 7.605e+02, threshold=7.185e+02, percent-clipped=0.0 2023-06-20 20:49:12,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=638484.0, ans=0.125 2023-06-20 20:49:26,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=638544.0, ans=0.125 2023-06-20 20:49:44,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=638544.0, ans=0.2 2023-06-20 20:49:51,620 INFO [train.py:996] (0/4) Epoch 4, batch 14950, loss[loss=0.2146, simple_loss=0.2986, pruned_loss=0.06529, over 21679.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3284, pruned_loss=0.09186, over 4273197.14 frames. ], batch size: 298, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:50:01,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=638604.0, ans=0.95 2023-06-20 20:50:29,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=638664.0, ans=0.1 2023-06-20 20:50:34,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=638724.0, ans=0.2 2023-06-20 20:50:43,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=638724.0, ans=0.125 2023-06-20 20:51:29,186 INFO [train.py:996] (0/4) Epoch 4, batch 15000, loss[loss=0.2277, simple_loss=0.2968, pruned_loss=0.07931, over 21791.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.33, pruned_loss=0.09237, over 4274475.37 frames. ], batch size: 247, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 20:51:29,187 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 20:52:21,546 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2644, simple_loss=0.3595, pruned_loss=0.08463, over 1796401.00 frames. 2023-06-20 20:52:21,547 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 20:52:30,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=638904.0, ans=0.0 2023-06-20 20:52:56,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=639024.0, ans=0.125 2023-06-20 20:53:00,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-20 20:53:05,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.416e+02 2.838e+02 3.359e+02 5.526e+02, threshold=5.676e+02, percent-clipped=0.0 2023-06-20 20:53:36,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=639084.0, ans=0.1 2023-06-20 20:54:04,785 INFO [train.py:996] (0/4) Epoch 4, batch 15050, loss[loss=0.2464, simple_loss=0.3337, pruned_loss=0.07949, over 21642.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3299, pruned_loss=0.0932, over 4270611.89 frames. ], batch size: 263, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 20:54:45,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=639324.0, ans=0.125 2023-06-20 20:55:14,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=639444.0, ans=0.125 2023-06-20 20:55:37,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=639444.0, ans=0.125 2023-06-20 20:55:41,719 INFO [train.py:996] (0/4) Epoch 4, batch 15100, loss[loss=0.2613, simple_loss=0.332, pruned_loss=0.09527, over 21342.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.331, pruned_loss=0.09224, over 4277816.43 frames. ], batch size: 159, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 20:55:57,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=639564.0, ans=0.09899494936611666 2023-06-20 20:56:16,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=639624.0, ans=0.0 2023-06-20 20:56:31,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.752e+02 3.160e+02 3.731e+02 5.799e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 20:57:17,850 INFO [train.py:996] (0/4) Epoch 4, batch 15150, loss[loss=0.2201, simple_loss=0.2773, pruned_loss=0.08144, over 21309.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3279, pruned_loss=0.09299, over 4276898.48 frames. ], batch size: 211, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 20:57:21,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=639804.0, ans=0.2 2023-06-20 20:57:28,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=639804.0, ans=0.125 2023-06-20 20:57:55,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=639924.0, ans=0.125 2023-06-20 20:58:12,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-20 20:58:39,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=640044.0, ans=0.05 2023-06-20 20:58:39,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=640044.0, ans=0.0 2023-06-20 20:58:46,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=640044.0, ans=0.0 2023-06-20 20:58:52,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=640104.0, ans=0.125 2023-06-20 20:58:53,766 INFO [train.py:996] (0/4) Epoch 4, batch 15200, loss[loss=0.2037, simple_loss=0.2627, pruned_loss=0.07231, over 21228.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3183, pruned_loss=0.08841, over 4258467.06 frames. ], batch size: 143, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 20:59:01,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=640104.0, ans=0.125 2023-06-20 20:59:43,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.484e+02 2.960e+02 3.412e+02 6.984e+02, threshold=5.920e+02, percent-clipped=2.0 2023-06-20 20:59:46,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=640224.0, ans=0.015 2023-06-20 21:00:29,825 INFO [train.py:996] (0/4) Epoch 4, batch 15250, loss[loss=0.2538, simple_loss=0.3147, pruned_loss=0.09639, over 21467.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.313, pruned_loss=0.08669, over 4269526.59 frames. ], batch size: 389, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 21:00:30,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=640404.0, ans=0.125 2023-06-20 21:00:41,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-20 21:02:05,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=640644.0, ans=15.0 2023-06-20 21:02:07,005 INFO [train.py:996] (0/4) Epoch 4, batch 15300, loss[loss=0.2256, simple_loss=0.2895, pruned_loss=0.08092, over 20160.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3152, pruned_loss=0.08931, over 4259294.75 frames. ], batch size: 702, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 21:02:31,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=640764.0, ans=0.125 2023-06-20 21:02:33,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-20 21:02:41,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-20 21:02:50,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=640824.0, ans=0.125 2023-06-20 21:03:13,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.730e+02 3.158e+02 3.743e+02 8.189e+02, threshold=6.315e+02, percent-clipped=2.0 2023-06-20 21:03:30,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=640884.0, ans=0.125 2023-06-20 21:03:48,996 INFO [train.py:996] (0/4) Epoch 4, batch 15350, loss[loss=0.2866, simple_loss=0.3494, pruned_loss=0.1119, over 21797.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3205, pruned_loss=0.09211, over 4265797.47 frames. ], batch size: 441, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:04:00,660 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:04:29,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=641064.0, ans=0.125 2023-06-20 21:04:38,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=641064.0, ans=0.0 2023-06-20 21:05:05,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-20 21:05:12,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=641184.0, ans=0.125 2023-06-20 21:05:39,759 INFO [train.py:996] (0/4) Epoch 4, batch 15400, loss[loss=0.2288, simple_loss=0.3115, pruned_loss=0.073, over 21170.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3201, pruned_loss=0.0894, over 4271626.54 frames. ], batch size: 143, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:06:21,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.94 vs. limit=22.5 2023-06-20 21:06:23,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.400e+02 2.718e+02 3.167e+02 5.553e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-20 21:06:59,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=641544.0, ans=0.125 2023-06-20 21:07:10,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=641544.0, ans=0.2 2023-06-20 21:07:15,588 INFO [train.py:996] (0/4) Epoch 4, batch 15450, loss[loss=0.2365, simple_loss=0.3099, pruned_loss=0.08151, over 21899.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3185, pruned_loss=0.08827, over 4267393.00 frames. ], batch size: 107, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:08:43,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=641844.0, ans=0.2 2023-06-20 21:08:57,928 INFO [train.py:996] (0/4) Epoch 4, batch 15500, loss[loss=0.2269, simple_loss=0.2855, pruned_loss=0.08415, over 21284.00 frames. ], tot_loss[loss=0.25, simple_loss=0.322, pruned_loss=0.08903, over 4257122.16 frames. ], batch size: 608, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:09:10,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=641904.0, ans=0.125 2023-06-20 21:09:13,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=641964.0, ans=0.0 2023-06-20 21:09:15,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-20 21:09:31,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=641964.0, ans=0.0 2023-06-20 21:09:46,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-20 21:09:48,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 2.414e+02 2.758e+02 3.182e+02 6.018e+02, threshold=5.516e+02, percent-clipped=3.0 2023-06-20 21:09:53,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=642084.0, ans=0.1 2023-06-20 21:10:42,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=642144.0, ans=0.0 2023-06-20 21:10:49,953 INFO [train.py:996] (0/4) Epoch 4, batch 15550, loss[loss=0.1909, simple_loss=0.2814, pruned_loss=0.05018, over 21719.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3212, pruned_loss=0.08685, over 4257253.77 frames. ], batch size: 298, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 21:11:24,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=642264.0, ans=0.2 2023-06-20 21:11:24,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=642264.0, ans=0.0 2023-06-20 21:12:19,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-20 21:12:25,920 INFO [train.py:996] (0/4) Epoch 4, batch 15600, loss[loss=0.2388, simple_loss=0.2984, pruned_loss=0.0896, over 21168.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3164, pruned_loss=0.08485, over 4258715.41 frames. ], batch size: 548, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:12:27,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=642504.0, ans=0.125 2023-06-20 21:12:32,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=642504.0, ans=0.2 2023-06-20 21:13:32,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.378e+02 2.705e+02 3.069e+02 6.852e+02, threshold=5.411e+02, percent-clipped=1.0 2023-06-20 21:13:45,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=642684.0, ans=0.04949747468305833 2023-06-20 21:13:46,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=642684.0, ans=0.2 2023-06-20 21:13:49,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=642684.0, ans=0.125 2023-06-20 21:14:17,075 INFO [train.py:996] (0/4) Epoch 4, batch 15650, loss[loss=0.2554, simple_loss=0.3058, pruned_loss=0.1025, over 21527.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.315, pruned_loss=0.08422, over 4259707.71 frames. ], batch size: 414, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:14:18,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 21:14:22,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=642804.0, ans=0.0 2023-06-20 21:14:24,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=642804.0, ans=0.2 2023-06-20 21:15:09,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=642924.0, ans=0.125 2023-06-20 21:15:12,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-20 21:16:05,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643044.0, ans=0.1 2023-06-20 21:16:18,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=643104.0, ans=0.1 2023-06-20 21:16:19,617 INFO [train.py:996] (0/4) Epoch 4, batch 15700, loss[loss=0.2578, simple_loss=0.3157, pruned_loss=0.09998, over 21271.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3108, pruned_loss=0.08387, over 4263817.17 frames. ], batch size: 471, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:16:49,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=643164.0, ans=0.125 2023-06-20 21:17:00,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=643164.0, ans=0.0 2023-06-20 21:17:16,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.295e+02 2.525e+02 2.882e+02 4.287e+02, threshold=5.050e+02, percent-clipped=0.0 2023-06-20 21:17:24,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=12.0 2023-06-20 21:17:58,120 INFO [train.py:996] (0/4) Epoch 4, batch 15750, loss[loss=0.2244, simple_loss=0.2857, pruned_loss=0.0815, over 21789.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.305, pruned_loss=0.08298, over 4259222.23 frames. ], batch size: 118, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 21:18:06,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=643404.0, ans=0.125 2023-06-20 21:18:34,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=643464.0, ans=0.125 2023-06-20 21:18:51,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.26 vs. limit=10.0 2023-06-20 21:18:56,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=643524.0, ans=0.2 2023-06-20 21:19:12,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=643584.0, ans=0.0 2023-06-20 21:19:14,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-06-20 21:19:33,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=643644.0, ans=0.0 2023-06-20 21:19:44,100 INFO [train.py:996] (0/4) Epoch 4, batch 15800, loss[loss=0.2284, simple_loss=0.2854, pruned_loss=0.08565, over 21684.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2999, pruned_loss=0.08279, over 4265742.45 frames. ], batch size: 333, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 21:19:44,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643704.0, ans=0.1 2023-06-20 21:20:34,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-20 21:20:39,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=22.5 2023-06-20 21:20:51,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.525e+02 2.910e+02 3.653e+02 6.469e+02, threshold=5.821e+02, percent-clipped=2.0 2023-06-20 21:21:15,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=643884.0, ans=0.125 2023-06-20 21:21:34,812 INFO [train.py:996] (0/4) Epoch 4, batch 15850, loss[loss=0.2306, simple_loss=0.2975, pruned_loss=0.08181, over 20058.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3031, pruned_loss=0.08566, over 4270668.81 frames. ], batch size: 703, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 21:22:28,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-20 21:22:59,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=644244.0, ans=0.125 2023-06-20 21:23:02,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=644244.0, ans=0.125 2023-06-20 21:23:02,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=644244.0, ans=0.125 2023-06-20 21:23:12,749 INFO [train.py:996] (0/4) Epoch 4, batch 15900, loss[loss=0.2301, simple_loss=0.3022, pruned_loss=0.07902, over 21778.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2998, pruned_loss=0.0846, over 4275768.72 frames. ], batch size: 124, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 21:23:16,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=644304.0, ans=0.0 2023-06-20 21:23:21,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=644304.0, ans=0.0 2023-06-20 21:23:33,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=644364.0, ans=0.2 2023-06-20 21:24:09,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=644424.0, ans=0.2 2023-06-20 21:24:09,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=644424.0, ans=0.05 2023-06-20 21:24:14,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.520e+02 3.020e+02 3.568e+02 5.069e+02, threshold=6.040e+02, percent-clipped=0.0 2023-06-20 21:24:39,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=22.5 2023-06-20 21:24:43,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=644544.0, ans=0.0 2023-06-20 21:24:54,175 INFO [train.py:996] (0/4) Epoch 4, batch 15950, loss[loss=0.2156, simple_loss=0.2899, pruned_loss=0.07064, over 20765.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3015, pruned_loss=0.08291, over 4271878.19 frames. ], batch size: 607, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 21:24:57,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=644604.0, ans=0.05 2023-06-20 21:25:21,274 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:25:32,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=644664.0, ans=0.0 2023-06-20 21:26:33,928 INFO [train.py:996] (0/4) Epoch 4, batch 16000, loss[loss=0.2042, simple_loss=0.2922, pruned_loss=0.0581, over 21693.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3037, pruned_loss=0.08103, over 4272109.14 frames. ], batch size: 247, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:26:40,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=644904.0, ans=0.125 2023-06-20 21:26:44,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=644904.0, ans=0.125 2023-06-20 21:26:50,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=644964.0, ans=0.1 2023-06-20 21:27:26,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=645024.0, ans=0.125 2023-06-20 21:27:32,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.333e+02 2.770e+02 3.330e+02 6.195e+02, threshold=5.540e+02, percent-clipped=2.0 2023-06-20 21:28:11,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=645144.0, ans=0.0 2023-06-20 21:28:20,584 INFO [train.py:996] (0/4) Epoch 4, batch 16050, loss[loss=0.3195, simple_loss=0.4061, pruned_loss=0.1164, over 21540.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3061, pruned_loss=0.07846, over 4275311.61 frames. ], batch size: 471, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:29:44,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=645444.0, ans=0.125 2023-06-20 21:29:48,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=645444.0, ans=15.0 2023-06-20 21:29:59,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=645444.0, ans=0.0 2023-06-20 21:30:02,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-20 21:30:05,004 INFO [train.py:996] (0/4) Epoch 4, batch 16100, loss[loss=0.2365, simple_loss=0.3268, pruned_loss=0.07315, over 21400.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3095, pruned_loss=0.07976, over 4271836.04 frames. ], batch size: 211, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 21:31:00,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.499e+02 3.072e+02 4.062e+02 6.589e+02, threshold=6.145e+02, percent-clipped=6.0 2023-06-20 21:31:10,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=645684.0, ans=0.125 2023-06-20 21:31:19,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=645744.0, ans=0.0 2023-06-20 21:31:25,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=645744.0, ans=0.07 2023-06-20 21:31:37,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-20 21:31:41,515 INFO [train.py:996] (0/4) Epoch 4, batch 16150, loss[loss=0.25, simple_loss=0.3061, pruned_loss=0.0969, over 21538.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.311, pruned_loss=0.08301, over 4281731.24 frames. ], batch size: 548, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:31:53,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=645804.0, ans=0.125 2023-06-20 21:32:45,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=645984.0, ans=0.125 2023-06-20 21:32:58,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=645984.0, ans=0.125 2023-06-20 21:33:17,671 INFO [train.py:996] (0/4) Epoch 4, batch 16200, loss[loss=0.2964, simple_loss=0.3634, pruned_loss=0.1147, over 21743.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3157, pruned_loss=0.08404, over 4282128.32 frames. ], batch size: 441, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:33:18,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=646104.0, ans=12.0 2023-06-20 21:33:27,022 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:33:58,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-20 21:34:10,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=646224.0, ans=10.0 2023-06-20 21:34:14,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.796e+02 3.136e+02 3.765e+02 7.945e+02, threshold=6.271e+02, percent-clipped=2.0 2023-06-20 21:34:51,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=646344.0, ans=0.0 2023-06-20 21:34:53,532 INFO [train.py:996] (0/4) Epoch 4, batch 16250, loss[loss=0.2034, simple_loss=0.2785, pruned_loss=0.06418, over 21640.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3169, pruned_loss=0.08408, over 4279917.94 frames. ], batch size: 247, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:35:00,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-20 21:35:02,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=646404.0, ans=0.2 2023-06-20 21:36:01,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=646584.0, ans=0.125 2023-06-20 21:36:30,012 INFO [train.py:996] (0/4) Epoch 4, batch 16300, loss[loss=0.2408, simple_loss=0.325, pruned_loss=0.07831, over 21553.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3103, pruned_loss=0.08062, over 4276741.78 frames. ], batch size: 441, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:36:34,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=646704.0, ans=0.125 2023-06-20 21:36:42,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.06 vs. limit=22.5 2023-06-20 21:37:03,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=646764.0, ans=0.1 2023-06-20 21:37:27,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=646824.0, ans=0.0 2023-06-20 21:37:28,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.180e+02 2.582e+02 2.940e+02 4.968e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-20 21:37:40,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-20 21:37:50,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=646884.0, ans=0.125 2023-06-20 21:37:54,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=646944.0, ans=0.125 2023-06-20 21:38:10,080 INFO [train.py:996] (0/4) Epoch 4, batch 16350, loss[loss=0.2899, simple_loss=0.3505, pruned_loss=0.1146, over 21538.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3083, pruned_loss=0.08086, over 4270857.01 frames. ], batch size: 414, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 21:38:24,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=12.0 2023-06-20 21:39:11,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=647124.0, ans=0.125 2023-06-20 21:39:42,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-20 21:39:44,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-20 21:39:54,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-20 21:39:55,228 INFO [train.py:996] (0/4) Epoch 4, batch 16400, loss[loss=0.2329, simple_loss=0.301, pruned_loss=0.08235, over 21917.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3132, pruned_loss=0.08359, over 4271820.61 frames. ], batch size: 118, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:40:05,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=647304.0, ans=0.125 2023-06-20 21:40:48,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=647424.0, ans=0.125 2023-06-20 21:40:52,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.622e+02 2.917e+02 3.537e+02 6.383e+02, threshold=5.834e+02, percent-clipped=4.0 2023-06-20 21:41:09,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=647484.0, ans=0.0 2023-06-20 21:41:19,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-20 21:41:32,875 INFO [train.py:996] (0/4) Epoch 4, batch 16450, loss[loss=0.2262, simple_loss=0.2909, pruned_loss=0.0807, over 21716.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3127, pruned_loss=0.08436, over 4281285.53 frames. ], batch size: 230, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:41:52,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=647664.0, ans=0.125 2023-06-20 21:42:28,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=647724.0, ans=0.2 2023-06-20 21:42:58,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=647844.0, ans=0.125 2023-06-20 21:43:17,619 INFO [train.py:996] (0/4) Epoch 4, batch 16500, loss[loss=0.2997, simple_loss=0.3675, pruned_loss=0.1159, over 21514.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3129, pruned_loss=0.08513, over 4281693.82 frames. ], batch size: 508, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:43:53,322 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-108000.pt 2023-06-20 21:43:58,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=647964.0, ans=0.2 2023-06-20 21:44:24,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.642e+02 3.003e+02 3.781e+02 6.239e+02, threshold=6.006e+02, percent-clipped=1.0 2023-06-20 21:45:17,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=648144.0, ans=0.125 2023-06-20 21:45:30,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=648204.0, ans=0.0 2023-06-20 21:45:32,252 INFO [train.py:996] (0/4) Epoch 4, batch 16550, loss[loss=0.1159, simple_loss=0.1578, pruned_loss=0.03698, over 16768.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3088, pruned_loss=0.08105, over 4273494.61 frames. ], batch size: 60, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:45:38,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-20 21:45:41,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=648204.0, ans=0.025 2023-06-20 21:45:59,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=648264.0, ans=0.125 2023-06-20 21:46:14,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=648324.0, ans=0.1 2023-06-20 21:46:19,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-20 21:46:20,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=648324.0, ans=0.125 2023-06-20 21:46:31,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-20 21:47:12,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=648444.0, ans=0.125 2023-06-20 21:47:12,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=648444.0, ans=0.125 2023-06-20 21:47:29,698 INFO [train.py:996] (0/4) Epoch 4, batch 16600, loss[loss=0.2851, simple_loss=0.385, pruned_loss=0.09259, over 21798.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3183, pruned_loss=0.08482, over 4274864.14 frames. ], batch size: 282, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:47:31,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=648504.0, ans=0.2 2023-06-20 21:47:44,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=648504.0, ans=0.0 2023-06-20 21:48:03,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=648564.0, ans=0.125 2023-06-20 21:48:22,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.551e+02 2.918e+02 3.524e+02 5.305e+02, threshold=5.835e+02, percent-clipped=0.0 2023-06-20 21:48:29,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=648624.0, ans=0.0 2023-06-20 21:48:52,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648744.0, ans=0.1 2023-06-20 21:49:13,947 INFO [train.py:996] (0/4) Epoch 4, batch 16650, loss[loss=0.2634, simple_loss=0.341, pruned_loss=0.09295, over 21953.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3274, pruned_loss=0.08649, over 4276485.84 frames. ], batch size: 372, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 21:49:17,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=648804.0, ans=0.125 2023-06-20 21:49:25,585 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:49:43,018 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:50:18,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=648924.0, ans=0.0 2023-06-20 21:50:35,242 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:50:35,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-20 21:50:42,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=649044.0, ans=0.125 2023-06-20 21:50:49,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=15.0 2023-06-20 21:50:53,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=649044.0, ans=0.125 2023-06-20 21:50:55,793 INFO [train.py:996] (0/4) Epoch 4, batch 16700, loss[loss=0.2549, simple_loss=0.3369, pruned_loss=0.08651, over 21665.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3313, pruned_loss=0.08872, over 4275387.59 frames. ], batch size: 389, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:51:48,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-20 21:52:00,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=649224.0, ans=0.125 2023-06-20 21:52:03,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.731e+02 3.045e+02 3.440e+02 5.036e+02, threshold=6.090e+02, percent-clipped=0.0 2023-06-20 21:52:03,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649224.0, ans=0.1 2023-06-20 21:52:44,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=649344.0, ans=0.125 2023-06-20 21:52:55,499 INFO [train.py:996] (0/4) Epoch 4, batch 16750, loss[loss=0.2546, simple_loss=0.3226, pruned_loss=0.09328, over 19842.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3333, pruned_loss=0.09163, over 4274992.62 frames. ], batch size: 702, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:53:04,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=649404.0, ans=0.125 2023-06-20 21:53:13,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=649404.0, ans=0.0 2023-06-20 21:53:18,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649404.0, ans=0.1 2023-06-20 21:53:27,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-20 21:53:41,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2023-06-20 21:54:32,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649644.0, ans=0.1 2023-06-20 21:54:44,326 INFO [train.py:996] (0/4) Epoch 4, batch 16800, loss[loss=0.2355, simple_loss=0.3135, pruned_loss=0.07875, over 21816.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.337, pruned_loss=0.09132, over 4276772.72 frames. ], batch size: 332, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:55:29,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=649824.0, ans=0.125 2023-06-20 21:55:44,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.879e+02 3.327e+02 4.049e+02 9.640e+02, threshold=6.654e+02, percent-clipped=7.0 2023-06-20 21:55:44,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=649824.0, ans=0.125 2023-06-20 21:56:20,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=649884.0, ans=0.0 2023-06-20 21:56:21,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=649884.0, ans=0.125 2023-06-20 21:56:54,185 INFO [train.py:996] (0/4) Epoch 4, batch 16850, loss[loss=0.2606, simple_loss=0.3257, pruned_loss=0.09772, over 21785.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3346, pruned_loss=0.09161, over 4273108.30 frames. ], batch size: 441, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:57:16,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=650064.0, ans=0.2 2023-06-20 21:57:21,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=650064.0, ans=0.0 2023-06-20 21:57:31,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=650124.0, ans=0.125 2023-06-20 21:57:33,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=650124.0, ans=0.125 2023-06-20 21:57:54,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=650184.0, ans=0.125 2023-06-20 21:57:56,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=650184.0, ans=0.0 2023-06-20 21:58:30,572 INFO [train.py:996] (0/4) Epoch 4, batch 16900, loss[loss=0.2112, simple_loss=0.2879, pruned_loss=0.06722, over 20877.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3274, pruned_loss=0.09002, over 4276280.46 frames. ], batch size: 608, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 21:59:16,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.471e+02 2.836e+02 3.403e+02 5.773e+02, threshold=5.671e+02, percent-clipped=0.0 2023-06-20 21:59:26,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=650484.0, ans=0.2 2023-06-20 21:59:39,050 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:00:05,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=650544.0, ans=0.125 2023-06-20 22:00:07,489 INFO [train.py:996] (0/4) Epoch 4, batch 16950, loss[loss=0.2901, simple_loss=0.333, pruned_loss=0.1236, over 21764.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3199, pruned_loss=0.0878, over 4271987.77 frames. ], batch size: 508, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:00:11,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=12.0 2023-06-20 22:00:52,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=650724.0, ans=0.125 2023-06-20 22:01:54,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=650844.0, ans=0.125 2023-06-20 22:02:00,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-20 22:02:01,824 INFO [train.py:996] (0/4) Epoch 4, batch 17000, loss[loss=0.2253, simple_loss=0.2814, pruned_loss=0.08458, over 21226.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.316, pruned_loss=0.08809, over 4273216.32 frames. ], batch size: 608, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:02:29,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=650964.0, ans=0.0 2023-06-20 22:02:30,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=650964.0, ans=0.125 2023-06-20 22:02:47,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.716e+02 3.273e+02 3.995e+02 5.622e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-20 22:02:52,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-20 22:03:31,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651144.0, ans=0.1 2023-06-20 22:03:37,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=651204.0, ans=0.0 2023-06-20 22:03:38,873 INFO [train.py:996] (0/4) Epoch 4, batch 17050, loss[loss=0.2511, simple_loss=0.3014, pruned_loss=0.1004, over 20253.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3243, pruned_loss=0.09178, over 4280107.80 frames. ], batch size: 707, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:04:15,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=651324.0, ans=0.0 2023-06-20 22:04:45,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=651384.0, ans=0.125 2023-06-20 22:05:13,753 INFO [train.py:996] (0/4) Epoch 4, batch 17100, loss[loss=0.2733, simple_loss=0.3386, pruned_loss=0.104, over 21789.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3236, pruned_loss=0.09279, over 4279458.96 frames. ], batch size: 112, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:05:18,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=651504.0, ans=0.125 2023-06-20 22:05:36,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651564.0, ans=0.1 2023-06-20 22:05:43,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=651564.0, ans=0.125 2023-06-20 22:05:47,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=22.5 2023-06-20 22:05:59,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.611e+02 3.092e+02 3.570e+02 5.618e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-20 22:06:09,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651684.0, ans=0.1 2023-06-20 22:06:11,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=651684.0, ans=0.125 2023-06-20 22:06:11,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=651684.0, ans=0.125 2023-06-20 22:06:51,005 INFO [train.py:996] (0/4) Epoch 4, batch 17150, loss[loss=0.2057, simple_loss=0.2765, pruned_loss=0.0674, over 21494.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3189, pruned_loss=0.09152, over 4278730.91 frames. ], batch size: 211, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:07:30,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=651864.0, ans=0.05 2023-06-20 22:07:30,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=651864.0, ans=0.125 2023-06-20 22:07:40,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-20 22:08:09,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-20 22:08:36,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=652044.0, ans=0.125 2023-06-20 22:08:48,887 INFO [train.py:996] (0/4) Epoch 4, batch 17200, loss[loss=0.3123, simple_loss=0.3617, pruned_loss=0.1315, over 21289.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3183, pruned_loss=0.09085, over 4273569.72 frames. ], batch size: 507, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 22:09:29,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=652224.0, ans=0.125 2023-06-20 22:09:48,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.497e+02 2.913e+02 3.458e+02 6.747e+02, threshold=5.827e+02, percent-clipped=3.0 2023-06-20 22:10:29,123 INFO [train.py:996] (0/4) Epoch 4, batch 17250, loss[loss=0.2371, simple_loss=0.3183, pruned_loss=0.07797, over 21732.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3229, pruned_loss=0.09321, over 4273086.67 frames. ], batch size: 298, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:10:34,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=652404.0, ans=0.125 2023-06-20 22:10:40,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=652404.0, ans=0.125 2023-06-20 22:11:06,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=652464.0, ans=0.0 2023-06-20 22:11:53,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=652644.0, ans=0.125 2023-06-20 22:11:56,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=652644.0, ans=0.0 2023-06-20 22:12:08,322 INFO [train.py:996] (0/4) Epoch 4, batch 17300, loss[loss=0.268, simple_loss=0.3387, pruned_loss=0.09866, over 21667.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3308, pruned_loss=0.09629, over 4274256.76 frames. ], batch size: 351, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:12:45,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=652764.0, ans=0.0 2023-06-20 22:12:45,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.69 vs. limit=22.5 2023-06-20 22:12:46,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=652764.0, ans=0.2 2023-06-20 22:12:48,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=652764.0, ans=0.125 2023-06-20 22:12:57,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=652824.0, ans=0.5 2023-06-20 22:12:59,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=652824.0, ans=0.2 2023-06-20 22:13:13,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.676e+02 3.122e+02 3.587e+02 4.716e+02, threshold=6.244e+02, percent-clipped=0.0 2023-06-20 22:13:52,571 INFO [train.py:996] (0/4) Epoch 4, batch 17350, loss[loss=0.2232, simple_loss=0.3134, pruned_loss=0.06647, over 21750.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3311, pruned_loss=0.09562, over 4280026.87 frames. ], batch size: 332, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:14:33,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=653064.0, ans=0.125 2023-06-20 22:14:36,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653064.0, ans=0.1 2023-06-20 22:14:58,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=653124.0, ans=0.0 2023-06-20 22:15:43,271 INFO [train.py:996] (0/4) Epoch 4, batch 17400, loss[loss=0.3007, simple_loss=0.3729, pruned_loss=0.1142, over 21462.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.328, pruned_loss=0.09174, over 4275302.04 frames. ], batch size: 471, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:15:49,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=653304.0, ans=0.0 2023-06-20 22:16:09,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=653304.0, ans=0.125 2023-06-20 22:16:35,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=653364.0, ans=0.125 2023-06-20 22:16:49,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.607e+02 3.173e+02 3.817e+02 7.670e+02, threshold=6.346e+02, percent-clipped=2.0 2023-06-20 22:17:40,495 INFO [train.py:996] (0/4) Epoch 4, batch 17450, loss[loss=0.2381, simple_loss=0.2911, pruned_loss=0.09255, over 20124.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3221, pruned_loss=0.08851, over 4266488.54 frames. ], batch size: 707, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:17:52,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=653604.0, ans=0.125 2023-06-20 22:19:11,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-20 22:19:11,783 INFO [train.py:996] (0/4) Epoch 4, batch 17500, loss[loss=0.196, simple_loss=0.2933, pruned_loss=0.04933, over 19856.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.317, pruned_loss=0.08515, over 4270803.88 frames. ], batch size: 703, lr: 7.89e-03, grad_scale: 32.0 2023-06-20 22:19:56,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=654024.0, ans=0.125 2023-06-20 22:20:00,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.227e+02 2.545e+02 2.907e+02 4.795e+02, threshold=5.089e+02, percent-clipped=0.0 2023-06-20 22:20:26,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=654144.0, ans=0.0 2023-06-20 22:20:29,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=654144.0, ans=0.2 2023-06-20 22:20:37,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-20 22:20:38,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=654144.0, ans=0.125 2023-06-20 22:20:46,770 INFO [train.py:996] (0/4) Epoch 4, batch 17550, loss[loss=0.2226, simple_loss=0.3003, pruned_loss=0.07247, over 16147.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3163, pruned_loss=0.08342, over 4261625.44 frames. ], batch size: 62, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 22:21:25,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.05 vs. limit=10.0 2023-06-20 22:21:29,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=654324.0, ans=0.125 2023-06-20 22:22:02,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=654444.0, ans=0.0 2023-06-20 22:22:22,693 INFO [train.py:996] (0/4) Epoch 4, batch 17600, loss[loss=0.2598, simple_loss=0.3321, pruned_loss=0.09373, over 21684.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3191, pruned_loss=0.08424, over 4271023.85 frames. ], batch size: 351, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:22:51,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=654564.0, ans=0.0 2023-06-20 22:23:12,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.529e+02 3.022e+02 3.829e+02 6.673e+02, threshold=6.045e+02, percent-clipped=10.0 2023-06-20 22:23:59,705 INFO [train.py:996] (0/4) Epoch 4, batch 17650, loss[loss=0.2445, simple_loss=0.3184, pruned_loss=0.08529, over 21568.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3195, pruned_loss=0.08461, over 4260839.40 frames. ], batch size: 441, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:24:21,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=654864.0, ans=0.2 2023-06-20 22:25:36,826 INFO [train.py:996] (0/4) Epoch 4, batch 17700, loss[loss=0.2341, simple_loss=0.3054, pruned_loss=0.08139, over 21354.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3121, pruned_loss=0.08173, over 4255470.30 frames. ], batch size: 159, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:25:44,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=655104.0, ans=0.125 2023-06-20 22:25:59,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=655164.0, ans=0.2 2023-06-20 22:26:01,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-20 22:26:36,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=655224.0, ans=0.125 2023-06-20 22:26:42,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.561e+02 3.035e+02 3.633e+02 7.633e+02, threshold=6.070e+02, percent-clipped=4.0 2023-06-20 22:26:56,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-20 22:27:07,270 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:27:11,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=655344.0, ans=0.125 2023-06-20 22:27:15,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-20 22:27:25,850 INFO [train.py:996] (0/4) Epoch 4, batch 17750, loss[loss=0.2261, simple_loss=0.3079, pruned_loss=0.07217, over 19971.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3202, pruned_loss=0.08582, over 4256945.14 frames. ], batch size: 703, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 22:27:43,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=655404.0, ans=0.0 2023-06-20 22:28:28,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.03 vs. limit=12.0 2023-06-20 22:28:32,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-20 22:28:46,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=655584.0, ans=0.125 2023-06-20 22:28:46,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=655584.0, ans=0.04949747468305833 2023-06-20 22:28:53,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=655584.0, ans=0.125 2023-06-20 22:29:23,621 INFO [train.py:996] (0/4) Epoch 4, batch 17800, loss[loss=0.2269, simple_loss=0.3081, pruned_loss=0.07286, over 21916.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3209, pruned_loss=0.08596, over 4257887.58 frames. ], batch size: 317, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:29:24,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=655704.0, ans=0.125 2023-06-20 22:30:30,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.510e+02 2.835e+02 3.285e+02 4.580e+02, threshold=5.670e+02, percent-clipped=0.0 2023-06-20 22:30:34,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-20 22:30:43,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-20 22:31:14,507 INFO [train.py:996] (0/4) Epoch 4, batch 17850, loss[loss=0.2814, simple_loss=0.3408, pruned_loss=0.111, over 21464.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3201, pruned_loss=0.08584, over 4261024.13 frames. ], batch size: 211, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:32:00,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-20 22:32:37,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=656184.0, ans=0.125 2023-06-20 22:32:48,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-06-20 22:33:11,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=656244.0, ans=0.125 2023-06-20 22:33:25,136 INFO [train.py:996] (0/4) Epoch 4, batch 17900, loss[loss=0.273, simple_loss=0.3627, pruned_loss=0.09166, over 21598.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3249, pruned_loss=0.08773, over 4262155.47 frames. ], batch size: 414, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:34:14,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656424.0, ans=0.1 2023-06-20 22:34:19,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.504e+02 2.807e+02 3.225e+02 4.472e+02, threshold=5.613e+02, percent-clipped=0.0 2023-06-20 22:34:23,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=656484.0, ans=0.125 2023-06-20 22:35:20,759 INFO [train.py:996] (0/4) Epoch 4, batch 17950, loss[loss=0.1872, simple_loss=0.2775, pruned_loss=0.04846, over 21517.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3231, pruned_loss=0.0841, over 4259473.04 frames. ], batch size: 230, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:35:23,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-20 22:35:37,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=656664.0, ans=0.125 2023-06-20 22:35:56,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=656724.0, ans=0.2 2023-06-20 22:36:03,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=656724.0, ans=0.0 2023-06-20 22:36:06,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=656724.0, ans=0.2 2023-06-20 22:36:06,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=656724.0, ans=0.2 2023-06-20 22:36:29,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=656784.0, ans=0.125 2023-06-20 22:36:44,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=656844.0, ans=0.125 2023-06-20 22:36:58,682 INFO [train.py:996] (0/4) Epoch 4, batch 18000, loss[loss=0.2369, simple_loss=0.2952, pruned_loss=0.08932, over 21553.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3157, pruned_loss=0.08233, over 4266885.93 frames. ], batch size: 414, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:36:58,693 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 22:37:41,573 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.7030, 5.1209, 5.4624, 4.8691], device='cuda:0') 2023-06-20 22:37:52,311 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.8129, 3.0289, 2.9848, 3.6525, 2.0859, 3.3967, 3.4971, 2.3250], device='cuda:0') 2023-06-20 22:37:57,970 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2692, simple_loss=0.3694, pruned_loss=0.08448, over 1796401.00 frames. 2023-06-20 22:37:57,970 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-20 22:38:23,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=656964.0, ans=0.0 2023-06-20 22:38:30,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=656964.0, ans=0.125 2023-06-20 22:38:34,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=657024.0, ans=0.0 2023-06-20 22:38:53,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.214e+02 2.674e+02 3.135e+02 4.981e+02, threshold=5.348e+02, percent-clipped=0.0 2023-06-20 22:39:02,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=657084.0, ans=0.125 2023-06-20 22:39:23,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=657144.0, ans=0.125 2023-06-20 22:39:32,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657144.0, ans=0.1 2023-06-20 22:39:36,678 INFO [train.py:996] (0/4) Epoch 4, batch 18050, loss[loss=0.2329, simple_loss=0.3027, pruned_loss=0.08154, over 21694.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.31, pruned_loss=0.08125, over 4271654.65 frames. ], batch size: 298, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 22:40:12,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=657324.0, ans=0.125 2023-06-20 22:40:18,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.66 vs. limit=22.5 2023-06-20 22:41:09,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=657444.0, ans=0.0 2023-06-20 22:41:10,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=657444.0, ans=0.125 2023-06-20 22:41:10,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=657444.0, ans=0.125 2023-06-20 22:41:11,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-20 22:41:20,647 INFO [train.py:996] (0/4) Epoch 4, batch 18100, loss[loss=0.2633, simple_loss=0.3254, pruned_loss=0.1006, over 20695.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.315, pruned_loss=0.08424, over 4266835.29 frames. ], batch size: 607, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:41:30,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=657504.0, ans=0.125 2023-06-20 22:41:33,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=657504.0, ans=0.125 2023-06-20 22:42:21,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.529e+02 2.874e+02 3.352e+02 6.003e+02, threshold=5.748e+02, percent-clipped=2.0 2023-06-20 22:42:27,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=657684.0, ans=0.125 2023-06-20 22:42:44,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=657744.0, ans=0.0 2023-06-20 22:42:55,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=657744.0, ans=0.0 2023-06-20 22:42:56,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-20 22:42:57,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=657804.0, ans=0.125 2023-06-20 22:42:58,558 INFO [train.py:996] (0/4) Epoch 4, batch 18150, loss[loss=0.2136, simple_loss=0.2784, pruned_loss=0.07445, over 21367.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3167, pruned_loss=0.08373, over 4270181.69 frames. ], batch size: 131, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:43:15,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-20 22:43:18,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=657864.0, ans=0.09899494936611666 2023-06-20 22:43:55,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=657924.0, ans=0.2 2023-06-20 22:43:58,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657984.0, ans=0.1 2023-06-20 22:44:36,738 INFO [train.py:996] (0/4) Epoch 4, batch 18200, loss[loss=0.1997, simple_loss=0.2704, pruned_loss=0.06451, over 21769.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3118, pruned_loss=0.08444, over 4267573.49 frames. ], batch size: 118, lr: 7.86e-03, grad_scale: 32.0 2023-06-20 22:44:40,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=658104.0, ans=0.125 2023-06-20 22:44:48,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=658104.0, ans=0.125 2023-06-20 22:45:00,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=658164.0, ans=0.1 2023-06-20 22:45:28,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-20 22:45:32,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.336e+02 2.746e+02 3.220e+02 5.960e+02, threshold=5.492e+02, percent-clipped=1.0 2023-06-20 22:45:35,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=658284.0, ans=0.2 2023-06-20 22:45:49,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=658284.0, ans=15.0 2023-06-20 22:45:50,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=658284.0, ans=0.2 2023-06-20 22:45:54,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=658344.0, ans=0.95 2023-06-20 22:46:05,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=658344.0, ans=0.2 2023-06-20 22:46:08,034 INFO [train.py:996] (0/4) Epoch 4, batch 18250, loss[loss=0.1671, simple_loss=0.2402, pruned_loss=0.04695, over 16950.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3038, pruned_loss=0.08111, over 4260543.92 frames. ], batch size: 64, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 22:46:57,720 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:47:29,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=658644.0, ans=0.125 2023-06-20 22:47:45,376 INFO [train.py:996] (0/4) Epoch 4, batch 18300, loss[loss=0.2985, simple_loss=0.3912, pruned_loss=0.1029, over 21398.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3037, pruned_loss=0.08166, over 4272167.21 frames. ], batch size: 548, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 22:47:49,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=22.5 2023-06-20 22:48:28,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=658824.0, ans=0.1 2023-06-20 22:48:36,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.526e+02 2.841e+02 3.430e+02 6.700e+02, threshold=5.681e+02, percent-clipped=2.0 2023-06-20 22:48:56,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-20 22:49:11,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=658944.0, ans=0.125 2023-06-20 22:49:15,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=658944.0, ans=22.5 2023-06-20 22:49:22,793 INFO [train.py:996] (0/4) Epoch 4, batch 18350, loss[loss=0.2379, simple_loss=0.3105, pruned_loss=0.08265, over 21219.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3099, pruned_loss=0.08135, over 4268658.29 frames. ], batch size: 176, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:49:30,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=659004.0, ans=0.0 2023-06-20 22:50:22,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=659184.0, ans=0.125 2023-06-20 22:50:33,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=659184.0, ans=10.0 2023-06-20 22:50:33,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-20 22:50:53,554 INFO [train.py:996] (0/4) Epoch 4, batch 18400, loss[loss=0.1686, simple_loss=0.2517, pruned_loss=0.04281, over 21481.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3051, pruned_loss=0.07998, over 4273934.44 frames. ], batch size: 212, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 22:51:14,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=659364.0, ans=0.125 2023-06-20 22:51:21,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=659364.0, ans=0.125 2023-06-20 22:51:44,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=659424.0, ans=0.0 2023-06-20 22:52:08,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.306e+02 2.639e+02 3.073e+02 4.656e+02, threshold=5.278e+02, percent-clipped=0.0 2023-06-20 22:52:43,447 INFO [train.py:996] (0/4) Epoch 4, batch 18450, loss[loss=0.2137, simple_loss=0.2867, pruned_loss=0.07033, over 21882.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3009, pruned_loss=0.07604, over 4261907.74 frames. ], batch size: 373, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:52:49,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=659604.0, ans=0.125 2023-06-20 22:53:28,955 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:54:07,988 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:54:19,822 INFO [train.py:996] (0/4) Epoch 4, batch 18500, loss[loss=0.2065, simple_loss=0.264, pruned_loss=0.07447, over 21188.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2961, pruned_loss=0.07457, over 4249563.63 frames. ], batch size: 548, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:54:48,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=659964.0, ans=0.07 2023-06-20 22:55:22,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 2.377e+02 2.788e+02 3.386e+02 5.871e+02, threshold=5.576e+02, percent-clipped=1.0 2023-06-20 22:55:56,911 INFO [train.py:996] (0/4) Epoch 4, batch 18550, loss[loss=0.217, simple_loss=0.2812, pruned_loss=0.07641, over 21805.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2949, pruned_loss=0.07418, over 4238572.46 frames. ], batch size: 118, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:56:01,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=660204.0, ans=0.0 2023-06-20 22:56:11,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=660204.0, ans=0.125 2023-06-20 22:57:45,992 INFO [train.py:996] (0/4) Epoch 4, batch 18600, loss[loss=0.1684, simple_loss=0.2421, pruned_loss=0.04735, over 21396.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2936, pruned_loss=0.07521, over 4232371.19 frames. ], batch size: 131, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 22:57:48,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-20 22:57:59,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=660564.0, ans=0.125 2023-06-20 22:58:11,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=660564.0, ans=0.125 2023-06-20 22:58:38,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=660624.0, ans=22.5 2023-06-20 22:58:42,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.410e+02 2.851e+02 3.256e+02 5.084e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-20 22:58:57,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=660684.0, ans=0.125 2023-06-20 22:58:59,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-20 22:59:01,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=660744.0, ans=0.2 2023-06-20 22:59:09,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=660744.0, ans=0.125 2023-06-20 22:59:15,910 INFO [train.py:996] (0/4) Epoch 4, batch 18650, loss[loss=0.2315, simple_loss=0.3234, pruned_loss=0.06975, over 20850.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2936, pruned_loss=0.07617, over 4237946.02 frames. ], batch size: 609, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 22:59:41,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660864.0, ans=0.1 2023-06-20 23:00:15,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660924.0, ans=0.1 2023-06-20 23:00:44,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=660984.0, ans=0.09899494936611666 2023-06-20 23:00:47,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-20 23:00:54,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=661044.0, ans=0.125 2023-06-20 23:00:56,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-20 23:00:59,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=661044.0, ans=0.125 2023-06-20 23:01:01,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-20 23:01:02,979 INFO [train.py:996] (0/4) Epoch 4, batch 18700, loss[loss=0.2029, simple_loss=0.2638, pruned_loss=0.07099, over 21580.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2915, pruned_loss=0.07777, over 4244811.71 frames. ], batch size: 263, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 23:01:38,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=661164.0, ans=0.125 2023-06-20 23:01:51,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661224.0, ans=0.125 2023-06-20 23:02:25,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.377e+02 2.645e+02 3.014e+02 5.356e+02, threshold=5.290e+02, percent-clipped=0.0 2023-06-20 23:02:34,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661284.0, ans=0.125 2023-06-20 23:03:03,477 INFO [train.py:996] (0/4) Epoch 4, batch 18750, loss[loss=0.2451, simple_loss=0.327, pruned_loss=0.08158, over 21598.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2946, pruned_loss=0.08051, over 4245579.38 frames. ], batch size: 230, lr: 7.84e-03, grad_scale: 16.0 2023-06-20 23:03:28,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=661464.0, ans=0.0 2023-06-20 23:03:54,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=661524.0, ans=0.07 2023-06-20 23:04:09,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=661584.0, ans=0.07 2023-06-20 23:04:23,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=661644.0, ans=0.125 2023-06-20 23:04:27,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=661644.0, ans=0.125 2023-06-20 23:04:39,204 INFO [train.py:996] (0/4) Epoch 4, batch 18800, loss[loss=0.2353, simple_loss=0.3184, pruned_loss=0.07605, over 21685.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3008, pruned_loss=0.08166, over 4248607.75 frames. ], batch size: 441, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:04:39,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661704.0, ans=0.125 2023-06-20 23:04:44,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=18.91 vs. limit=15.0 2023-06-20 23:05:09,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-20 23:05:45,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=661884.0, ans=0.125 2023-06-20 23:05:45,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 2.344e+02 2.808e+02 3.404e+02 5.687e+02, threshold=5.616e+02, percent-clipped=4.0 2023-06-20 23:06:13,767 INFO [train.py:996] (0/4) Epoch 4, batch 18850, loss[loss=0.2315, simple_loss=0.2951, pruned_loss=0.08393, over 21816.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2969, pruned_loss=0.07645, over 4240560.86 frames. ], batch size: 102, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:06:53,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=662124.0, ans=0.0 2023-06-20 23:06:53,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=662124.0, ans=0.0 2023-06-20 23:07:38,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=662244.0, ans=0.125 2023-06-20 23:07:51,150 INFO [train.py:996] (0/4) Epoch 4, batch 18900, loss[loss=0.2304, simple_loss=0.2828, pruned_loss=0.08898, over 21455.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2932, pruned_loss=0.07651, over 4238711.36 frames. ], batch size: 194, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 23:07:59,179 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:08:07,587 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:08:48,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.04 vs. limit=6.0 2023-06-20 23:08:48,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 2.224e+02 2.639e+02 3.131e+02 5.534e+02, threshold=5.278e+02, percent-clipped=0.0 2023-06-20 23:08:52,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=662484.0, ans=0.125 2023-06-20 23:09:28,921 INFO [train.py:996] (0/4) Epoch 4, batch 18950, loss[loss=0.285, simple_loss=0.3822, pruned_loss=0.09392, over 21769.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2953, pruned_loss=0.07877, over 4247949.53 frames. ], batch size: 415, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:09:47,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-20 23:10:45,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662784.0, ans=0.1 2023-06-20 23:11:22,823 INFO [train.py:996] (0/4) Epoch 4, batch 19000, loss[loss=0.2947, simple_loss=0.3631, pruned_loss=0.1132, over 21604.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3052, pruned_loss=0.08122, over 4262042.22 frames. ], batch size: 389, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:11:44,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=662964.0, ans=0.0 2023-06-20 23:12:06,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=663024.0, ans=0.1 2023-06-20 23:12:22,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=663084.0, ans=0.0 2023-06-20 23:12:25,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.761e+02 3.068e+02 3.870e+02 6.410e+02, threshold=6.137e+02, percent-clipped=5.0 2023-06-20 23:12:33,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=663084.0, ans=0.0 2023-06-20 23:12:46,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=663144.0, ans=0.2 2023-06-20 23:12:54,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=663144.0, ans=0.125 2023-06-20 23:12:59,009 INFO [train.py:996] (0/4) Epoch 4, batch 19050, loss[loss=0.2374, simple_loss=0.3026, pruned_loss=0.08612, over 21881.00 frames. ], tot_loss[loss=0.241, simple_loss=0.311, pruned_loss=0.08548, over 4270991.57 frames. ], batch size: 298, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:13:05,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-20 23:13:58,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=663384.0, ans=0.0 2023-06-20 23:14:04,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-20 23:14:35,607 INFO [train.py:996] (0/4) Epoch 4, batch 19100, loss[loss=0.2459, simple_loss=0.3024, pruned_loss=0.0947, over 20008.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3093, pruned_loss=0.08643, over 4278319.81 frames. ], batch size: 702, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 23:15:00,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663564.0, ans=0.1 2023-06-20 23:15:17,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=663624.0, ans=0.125 2023-06-20 23:15:28,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-20 23:15:39,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=663624.0, ans=0.0 2023-06-20 23:15:45,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-20 23:15:46,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.597e+02 2.867e+02 3.510e+02 4.906e+02, threshold=5.733e+02, percent-clipped=0.0 2023-06-20 23:16:19,235 INFO [train.py:996] (0/4) Epoch 4, batch 19150, loss[loss=0.247, simple_loss=0.3111, pruned_loss=0.09144, over 20898.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3116, pruned_loss=0.08687, over 4270991.37 frames. ], batch size: 608, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 23:16:25,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=663804.0, ans=0.0 2023-06-20 23:16:49,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=663864.0, ans=0.0 2023-06-20 23:17:29,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=663924.0, ans=0.0 2023-06-20 23:18:08,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=664104.0, ans=0.0 2023-06-20 23:18:09,658 INFO [train.py:996] (0/4) Epoch 4, batch 19200, loss[loss=0.2407, simple_loss=0.3366, pruned_loss=0.07241, over 21376.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3218, pruned_loss=0.08823, over 4274170.66 frames. ], batch size: 211, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:18:30,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=664104.0, ans=0.125 2023-06-20 23:19:07,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=664224.0, ans=0.0 2023-06-20 23:19:19,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 2.481e+02 2.914e+02 3.556e+02 7.085e+02, threshold=5.828e+02, percent-clipped=2.0 2023-06-20 23:19:33,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-20 23:19:45,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=664344.0, ans=0.2 2023-06-20 23:19:54,527 INFO [train.py:996] (0/4) Epoch 4, batch 19250, loss[loss=0.1714, simple_loss=0.262, pruned_loss=0.04042, over 21483.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3223, pruned_loss=0.08367, over 4272634.03 frames. ], batch size: 211, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:20:18,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=664404.0, ans=0.125 2023-06-20 23:20:18,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=664404.0, ans=0.0 2023-06-20 23:20:40,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=664464.0, ans=0.2 2023-06-20 23:20:56,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=664524.0, ans=0.125 2023-06-20 23:21:14,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=664584.0, ans=15.0 2023-06-20 23:21:42,202 INFO [train.py:996] (0/4) Epoch 4, batch 19300, loss[loss=0.2243, simple_loss=0.3005, pruned_loss=0.07405, over 21449.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3189, pruned_loss=0.0823, over 4277689.10 frames. ], batch size: 131, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:21:45,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=664704.0, ans=0.2 2023-06-20 23:21:58,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664704.0, ans=0.1 2023-06-20 23:22:12,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=664764.0, ans=0.2 2023-06-20 23:22:36,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-20 23:22:49,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.366e+02 2.667e+02 3.289e+02 5.476e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-20 23:23:13,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=664944.0, ans=0.1 2023-06-20 23:23:22,586 INFO [train.py:996] (0/4) Epoch 4, batch 19350, loss[loss=0.1943, simple_loss=0.2788, pruned_loss=0.0549, over 21591.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.312, pruned_loss=0.07777, over 4276592.93 frames. ], batch size: 230, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:23:27,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=665004.0, ans=0.95 2023-06-20 23:23:57,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-20 23:24:04,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=665124.0, ans=0.0 2023-06-20 23:24:47,892 INFO [train.py:996] (0/4) Epoch 4, batch 19400, loss[loss=0.2832, simple_loss=0.3398, pruned_loss=0.1133, over 21734.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.309, pruned_loss=0.07661, over 4272274.58 frames. ], batch size: 473, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:25:14,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-20 23:25:57,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 2.319e+02 2.845e+02 3.666e+02 6.009e+02, threshold=5.690e+02, percent-clipped=2.0 2023-06-20 23:26:29,721 INFO [train.py:996] (0/4) Epoch 4, batch 19450, loss[loss=0.213, simple_loss=0.2724, pruned_loss=0.07684, over 21505.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3075, pruned_loss=0.07957, over 4280846.94 frames. ], batch size: 212, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 23:26:56,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=665664.0, ans=0.125 2023-06-20 23:27:01,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.93 vs. limit=22.5 2023-06-20 23:27:07,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=665724.0, ans=0.125 2023-06-20 23:27:25,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=665784.0, ans=0.0 2023-06-20 23:27:27,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=665784.0, ans=0.125 2023-06-20 23:28:18,734 INFO [train.py:996] (0/4) Epoch 4, batch 19500, loss[loss=0.223, simple_loss=0.2913, pruned_loss=0.07728, over 21793.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3026, pruned_loss=0.08023, over 4285791.87 frames. ], batch size: 352, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:28:20,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=665904.0, ans=0.0 2023-06-20 23:28:35,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=665904.0, ans=0.0 2023-06-20 23:29:10,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=666024.0, ans=10.0 2023-06-20 23:29:13,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=666024.0, ans=0.04949747468305833 2023-06-20 23:29:19,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=666084.0, ans=0.2 2023-06-20 23:29:20,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.739e+02 3.335e+02 3.880e+02 8.921e+02, threshold=6.671e+02, percent-clipped=3.0 2023-06-20 23:29:30,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=666144.0, ans=12.0 2023-06-20 23:29:31,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=666144.0, ans=0.0 2023-06-20 23:29:56,779 INFO [train.py:996] (0/4) Epoch 4, batch 19550, loss[loss=0.1883, simple_loss=0.2708, pruned_loss=0.05292, over 21381.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2988, pruned_loss=0.07909, over 4275065.04 frames. ], batch size: 194, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 23:30:13,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-20 23:30:19,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.37 vs. limit=22.5 2023-06-20 23:31:38,469 INFO [train.py:996] (0/4) Epoch 4, batch 19600, loss[loss=0.2561, simple_loss=0.3194, pruned_loss=0.09638, over 21870.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3023, pruned_loss=0.07993, over 4273299.13 frames. ], batch size: 371, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:31:40,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=666504.0, ans=0.0 2023-06-20 23:31:44,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=666504.0, ans=0.2 2023-06-20 23:32:03,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=666564.0, ans=0.0 2023-06-20 23:32:21,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2023-06-20 23:32:32,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.560e+02 2.903e+02 3.359e+02 5.140e+02, threshold=5.805e+02, percent-clipped=0.0 2023-06-20 23:32:37,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=666684.0, ans=0.125 2023-06-20 23:33:13,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=666804.0, ans=0.1 2023-06-20 23:33:14,396 INFO [train.py:996] (0/4) Epoch 4, batch 19650, loss[loss=0.2331, simple_loss=0.3047, pruned_loss=0.08077, over 21775.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3074, pruned_loss=0.08355, over 4278466.08 frames. ], batch size: 298, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:33:30,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.50 vs. limit=15.0 2023-06-20 23:34:07,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=666984.0, ans=0.2 2023-06-20 23:34:52,420 INFO [train.py:996] (0/4) Epoch 4, batch 19700, loss[loss=0.2217, simple_loss=0.2926, pruned_loss=0.07544, over 21494.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3103, pruned_loss=0.08459, over 4274506.76 frames. ], batch size: 195, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:34:57,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=667104.0, ans=0.125 2023-06-20 23:35:39,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=667164.0, ans=0.125 2023-06-20 23:35:48,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=667224.0, ans=0.2 2023-06-20 23:36:21,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.712e+02 3.222e+02 3.780e+02 5.761e+02, threshold=6.445e+02, percent-clipped=0.0 2023-06-20 23:36:27,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=667284.0, ans=0.125 2023-06-20 23:36:30,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=667284.0, ans=0.0 2023-06-20 23:36:33,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=667284.0, ans=0.2 2023-06-20 23:36:39,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=667344.0, ans=0.125 2023-06-20 23:36:52,636 INFO [train.py:996] (0/4) Epoch 4, batch 19750, loss[loss=0.3021, simple_loss=0.3751, pruned_loss=0.1145, over 21786.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.322, pruned_loss=0.08755, over 4279293.21 frames. ], batch size: 414, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 23:37:04,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=667404.0, ans=0.125 2023-06-20 23:37:13,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=667464.0, ans=0.125 2023-06-20 23:37:46,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=667524.0, ans=0.0 2023-06-20 23:38:10,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=667584.0, ans=0.1 2023-06-20 23:38:20,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=667644.0, ans=0.125 2023-06-20 23:38:22,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=667644.0, ans=0.025 2023-06-20 23:38:24,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=667644.0, ans=0.125 2023-06-20 23:38:32,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-20 23:38:39,847 INFO [train.py:996] (0/4) Epoch 4, batch 19800, loss[loss=0.2068, simple_loss=0.2883, pruned_loss=0.06264, over 21827.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3205, pruned_loss=0.08733, over 4287011.83 frames. ], batch size: 351, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:38:57,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667764.0, ans=0.1 2023-06-20 23:40:03,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.339e+02 2.763e+02 3.291e+02 5.461e+02, threshold=5.526e+02, percent-clipped=0.0 2023-06-20 23:40:10,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-20 23:40:34,091 INFO [train.py:996] (0/4) Epoch 4, batch 19850, loss[loss=0.1901, simple_loss=0.2592, pruned_loss=0.06047, over 21785.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3099, pruned_loss=0.08055, over 4289848.99 frames. ], batch size: 112, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:40:36,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=668004.0, ans=0.0 2023-06-20 23:41:24,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=668124.0, ans=0.125 2023-06-20 23:41:40,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=668184.0, ans=0.125 2023-06-20 23:42:06,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=668244.0, ans=0.125 2023-06-20 23:42:11,836 INFO [train.py:996] (0/4) Epoch 4, batch 19900, loss[loss=0.1858, simple_loss=0.2609, pruned_loss=0.05538, over 21218.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3088, pruned_loss=0.07796, over 4292637.32 frames. ], batch size: 548, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:42:37,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=668364.0, ans=0.0 2023-06-20 23:43:18,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.481e+02 3.021e+02 3.710e+02 5.505e+02, threshold=6.042e+02, percent-clipped=0.0 2023-06-20 23:43:43,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-20 23:43:49,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-20 23:43:49,812 INFO [train.py:996] (0/4) Epoch 4, batch 19950, loss[loss=0.2127, simple_loss=0.2817, pruned_loss=0.07189, over 21604.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3025, pruned_loss=0.0772, over 4281099.54 frames. ], batch size: 332, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:44:43,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=668724.0, ans=0.0 2023-06-20 23:44:47,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=668724.0, ans=0.125 2023-06-20 23:45:36,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=668844.0, ans=0.125 2023-06-20 23:45:38,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2023-06-20 23:45:38,930 INFO [train.py:996] (0/4) Epoch 4, batch 20000, loss[loss=0.222, simple_loss=0.2968, pruned_loss=0.0736, over 21487.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.305, pruned_loss=0.07821, over 4278315.74 frames. ], batch size: 194, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 23:46:48,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-20 23:46:52,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.484e+02 2.763e+02 3.251e+02 5.110e+02, threshold=5.527e+02, percent-clipped=0.0 2023-06-20 23:47:14,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=669144.0, ans=0.125 2023-06-20 23:47:22,970 INFO [train.py:996] (0/4) Epoch 4, batch 20050, loss[loss=0.2469, simple_loss=0.3145, pruned_loss=0.08969, over 21843.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3067, pruned_loss=0.08076, over 4282919.66 frames. ], batch size: 124, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:47:49,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=669264.0, ans=0.125 2023-06-20 23:48:03,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=669324.0, ans=0.125 2023-06-20 23:48:32,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=669384.0, ans=0.125 2023-06-20 23:48:54,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-06-20 23:49:06,480 INFO [train.py:996] (0/4) Epoch 4, batch 20100, loss[loss=0.2493, simple_loss=0.3164, pruned_loss=0.09109, over 21883.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3099, pruned_loss=0.08387, over 4289714.19 frames. ], batch size: 107, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:49:07,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=669504.0, ans=0.0 2023-06-20 23:50:12,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.516e+02 2.838e+02 3.321e+02 5.094e+02, threshold=5.677e+02, percent-clipped=0.0 2023-06-20 23:50:25,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-06-20 23:50:44,844 INFO [train.py:996] (0/4) Epoch 4, batch 20150, loss[loss=0.272, simple_loss=0.344, pruned_loss=0.1, over 21738.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3208, pruned_loss=0.08778, over 4290688.45 frames. ], batch size: 332, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:51:37,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=669864.0, ans=0.0 2023-06-20 23:52:05,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=669924.0, ans=0.125 2023-06-20 23:52:07,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=669984.0, ans=0.125 2023-06-20 23:52:57,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-20 23:52:59,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-20 23:53:08,049 INFO [train.py:996] (0/4) Epoch 4, batch 20200, loss[loss=0.2604, simple_loss=0.3555, pruned_loss=0.08261, over 21834.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3249, pruned_loss=0.09016, over 4269474.08 frames. ], batch size: 316, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:53:51,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=670224.0, ans=0.125 2023-06-20 23:53:54,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=670224.0, ans=0.0 2023-06-20 23:54:00,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=670224.0, ans=0.09899494936611666 2023-06-20 23:54:20,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.584e+02 3.030e+02 3.747e+02 5.271e+02, threshold=6.060e+02, percent-clipped=0.0 2023-06-20 23:54:38,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-20 23:54:45,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=670344.0, ans=0.0 2023-06-20 23:54:51,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=670344.0, ans=0.125 2023-06-20 23:54:54,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=670344.0, ans=0.2 2023-06-20 23:55:04,490 INFO [train.py:996] (0/4) Epoch 4, batch 20250, loss[loss=0.2191, simple_loss=0.2867, pruned_loss=0.07575, over 21421.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3252, pruned_loss=0.08835, over 4268450.99 frames. ], batch size: 176, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:55:10,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=670404.0, ans=0.1 2023-06-20 23:55:39,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=670524.0, ans=0.0 2023-06-20 23:56:35,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=670644.0, ans=0.0 2023-06-20 23:56:58,701 INFO [train.py:996] (0/4) Epoch 4, batch 20300, loss[loss=0.2142, simple_loss=0.2896, pruned_loss=0.06936, over 21342.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3239, pruned_loss=0.08586, over 4269166.66 frames. ], batch size: 176, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 23:57:03,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.07 vs. limit=22.5 2023-06-20 23:57:53,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.338e+02 2.634e+02 2.984e+02 5.802e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-20 23:58:29,867 INFO [train.py:996] (0/4) Epoch 4, batch 20350, loss[loss=0.3279, simple_loss=0.3736, pruned_loss=0.1411, over 21545.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3232, pruned_loss=0.08594, over 4264379.25 frames. ], batch size: 507, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 23:59:08,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=671124.0, ans=0.125 2023-06-20 23:59:15,206 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:59:58,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-21 00:00:05,656 INFO [train.py:996] (0/4) Epoch 4, batch 20400, loss[loss=0.203, simple_loss=0.2709, pruned_loss=0.06754, over 16410.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3267, pruned_loss=0.08931, over 4259953.01 frames. ], batch size: 61, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:00:14,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=671304.0, ans=0.2 2023-06-21 00:00:20,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=671304.0, ans=0.125 2023-06-21 00:00:33,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=671364.0, ans=0.125 2023-06-21 00:00:33,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=671364.0, ans=0.0 2023-06-21 00:00:58,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=671424.0, ans=0.09899494936611666 2023-06-21 00:01:00,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=671484.0, ans=0.125 2023-06-21 00:01:05,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.821e+02 3.170e+02 3.737e+02 5.615e+02, threshold=6.339e+02, percent-clipped=2.0 2023-06-21 00:01:07,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-21 00:01:22,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=671544.0, ans=0.125 2023-06-21 00:01:31,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=671544.0, ans=0.125 2023-06-21 00:01:32,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=671544.0, ans=0.2 2023-06-21 00:01:47,139 INFO [train.py:996] (0/4) Epoch 4, batch 20450, loss[loss=0.2582, simple_loss=0.3236, pruned_loss=0.09639, over 21374.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3282, pruned_loss=0.09164, over 4255107.43 frames. ], batch size: 548, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:02:03,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=671664.0, ans=0.125 2023-06-21 00:02:12,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=671664.0, ans=0.125 2023-06-21 00:02:40,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=671784.0, ans=0.125 2023-06-21 00:02:45,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671784.0, ans=0.1 2023-06-21 00:03:16,205 INFO [train.py:996] (0/4) Epoch 4, batch 20500, loss[loss=0.2518, simple_loss=0.3087, pruned_loss=0.09741, over 21841.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3237, pruned_loss=0.09154, over 4253036.67 frames. ], batch size: 124, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:03:23,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=671904.0, ans=0.2 2023-06-21 00:03:24,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=671904.0, ans=0.04949747468305833 2023-06-21 00:03:29,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=671964.0, ans=0.125 2023-06-21 00:03:38,255 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-112000.pt 2023-06-21 00:03:46,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=671964.0, ans=0.0 2023-06-21 00:03:53,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=672024.0, ans=0.125 2023-06-21 00:04:05,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=672024.0, ans=0.125 2023-06-21 00:04:14,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.579e+02 2.936e+02 3.494e+02 5.643e+02, threshold=5.872e+02, percent-clipped=0.0 2023-06-21 00:04:18,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=672084.0, ans=0.0 2023-06-21 00:04:55,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=672204.0, ans=0.125 2023-06-21 00:04:56,151 INFO [train.py:996] (0/4) Epoch 4, batch 20550, loss[loss=0.3115, simple_loss=0.3723, pruned_loss=0.1253, over 21473.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3154, pruned_loss=0.08915, over 4250316.80 frames. ], batch size: 509, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:05:10,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=672264.0, ans=0.125 2023-06-21 00:05:36,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=672324.0, ans=0.125 2023-06-21 00:05:40,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=672324.0, ans=0.125 2023-06-21 00:06:15,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=672444.0, ans=0.0 2023-06-21 00:06:15,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=672444.0, ans=0.1 2023-06-21 00:06:20,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672444.0, ans=0.1 2023-06-21 00:06:37,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-21 00:06:42,189 INFO [train.py:996] (0/4) Epoch 4, batch 20600, loss[loss=0.2756, simple_loss=0.3464, pruned_loss=0.1024, over 21875.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3174, pruned_loss=0.08703, over 4241555.68 frames. ], batch size: 118, lr: 7.78e-03, grad_scale: 32.0 2023-06-21 00:07:02,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=672564.0, ans=0.1 2023-06-21 00:07:06,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=672564.0, ans=0.0 2023-06-21 00:07:25,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=672624.0, ans=0.125 2023-06-21 00:07:42,108 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.517e+02 2.771e+02 3.210e+02 6.753e+02, threshold=5.541e+02, percent-clipped=2.0 2023-06-21 00:07:48,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=672684.0, ans=0.125 2023-06-21 00:08:18,035 INFO [train.py:996] (0/4) Epoch 4, batch 20650, loss[loss=0.2372, simple_loss=0.3017, pruned_loss=0.08631, over 21612.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3136, pruned_loss=0.08694, over 4247025.58 frames. ], batch size: 391, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:08:21,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=672804.0, ans=0.0 2023-06-21 00:08:22,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=672804.0, ans=0.035 2023-06-21 00:08:28,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-21 00:09:10,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-21 00:09:34,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-21 00:09:56,824 INFO [train.py:996] (0/4) Epoch 4, batch 20700, loss[loss=0.2271, simple_loss=0.2864, pruned_loss=0.08387, over 21421.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3061, pruned_loss=0.08337, over 4243785.91 frames. ], batch size: 473, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:10:20,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=673164.0, ans=0.0 2023-06-21 00:11:07,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.300e+02 2.582e+02 3.075e+02 5.238e+02, threshold=5.163e+02, percent-clipped=0.0 2023-06-21 00:11:40,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=673344.0, ans=0.2 2023-06-21 00:11:44,522 INFO [train.py:996] (0/4) Epoch 4, batch 20750, loss[loss=0.2837, simple_loss=0.4006, pruned_loss=0.08337, over 20782.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3086, pruned_loss=0.08235, over 4248563.91 frames. ], batch size: 607, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:11:52,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=673404.0, ans=0.0 2023-06-21 00:11:52,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 00:11:54,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=673404.0, ans=0.2 2023-06-21 00:13:20,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=673644.0, ans=0.2 2023-06-21 00:13:28,254 INFO [train.py:996] (0/4) Epoch 4, batch 20800, loss[loss=0.2204, simple_loss=0.2808, pruned_loss=0.07994, over 21864.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3118, pruned_loss=0.08375, over 4254938.55 frames. ], batch size: 107, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:13:47,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=673764.0, ans=0.035 2023-06-21 00:14:11,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-21 00:14:14,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=8.0 2023-06-21 00:14:39,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.643e+02 3.014e+02 3.715e+02 5.359e+02, threshold=6.029e+02, percent-clipped=3.0 2023-06-21 00:14:41,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=673884.0, ans=0.0 2023-06-21 00:14:41,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=673884.0, ans=0.125 2023-06-21 00:14:42,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=673884.0, ans=0.125 2023-06-21 00:14:45,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=673884.0, ans=0.1 2023-06-21 00:14:53,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=673944.0, ans=0.125 2023-06-21 00:15:03,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=674004.0, ans=0.125 2023-06-21 00:15:04,246 INFO [train.py:996] (0/4) Epoch 4, batch 20850, loss[loss=0.2485, simple_loss=0.3051, pruned_loss=0.0959, over 21313.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3043, pruned_loss=0.08154, over 4260657.48 frames. ], batch size: 143, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:15:06,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=674004.0, ans=0.125 2023-06-21 00:15:28,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=674004.0, ans=0.125 2023-06-21 00:16:58,060 INFO [train.py:996] (0/4) Epoch 4, batch 20900, loss[loss=0.2194, simple_loss=0.298, pruned_loss=0.07039, over 21588.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3048, pruned_loss=0.08263, over 4271274.36 frames. ], batch size: 230, lr: 7.77e-03, grad_scale: 32.0 2023-06-21 00:17:17,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=674364.0, ans=0.125 2023-06-21 00:17:20,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674364.0, ans=0.1 2023-06-21 00:17:58,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.368e+02 2.710e+02 3.422e+02 5.410e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-21 00:18:26,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=674544.0, ans=0.0 2023-06-21 00:18:33,408 INFO [train.py:996] (0/4) Epoch 4, batch 20950, loss[loss=0.2344, simple_loss=0.3001, pruned_loss=0.08433, over 21836.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3023, pruned_loss=0.07955, over 4260643.13 frames. ], batch size: 124, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:18:35,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-21 00:19:32,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=674784.0, ans=0.2 2023-06-21 00:20:07,637 INFO [train.py:996] (0/4) Epoch 4, batch 21000, loss[loss=0.2308, simple_loss=0.2941, pruned_loss=0.08375, over 21685.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3021, pruned_loss=0.08026, over 4265265.56 frames. ], batch size: 263, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:20:07,639 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 00:20:59,674 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2681, simple_loss=0.367, pruned_loss=0.0846, over 1796401.00 frames. 2023-06-21 00:20:59,676 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-21 00:21:29,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-21 00:21:47,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=675024.0, ans=0.07 2023-06-21 00:21:59,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675084.0, ans=0.1 2023-06-21 00:22:00,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.300e+02 2.574e+02 2.981e+02 4.103e+02, threshold=5.148e+02, percent-clipped=0.0 2023-06-21 00:22:35,780 INFO [train.py:996] (0/4) Epoch 4, batch 21050, loss[loss=0.2268, simple_loss=0.2831, pruned_loss=0.08523, over 21222.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2997, pruned_loss=0.08018, over 4255704.63 frames. ], batch size: 608, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:22:38,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-21 00:23:13,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=675264.0, ans=0.125 2023-06-21 00:23:15,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=675324.0, ans=0.0 2023-06-21 00:23:32,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=675384.0, ans=0.2 2023-06-21 00:23:40,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-21 00:23:47,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=675384.0, ans=0.125 2023-06-21 00:24:14,090 INFO [train.py:996] (0/4) Epoch 4, batch 21100, loss[loss=0.2154, simple_loss=0.2797, pruned_loss=0.07555, over 21469.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.296, pruned_loss=0.07936, over 4265869.43 frames. ], batch size: 132, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:24:55,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675564.0, ans=0.1 2023-06-21 00:25:01,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-21 00:25:19,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=675624.0, ans=0.0 2023-06-21 00:25:22,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2023-06-21 00:25:28,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.453e+02 2.744e+02 3.185e+02 4.554e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 00:25:59,866 INFO [train.py:996] (0/4) Epoch 4, batch 21150, loss[loss=0.2338, simple_loss=0.2832, pruned_loss=0.09217, over 21312.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2935, pruned_loss=0.08021, over 4257609.11 frames. ], batch size: 473, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:26:18,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=675864.0, ans=0.0 2023-06-21 00:26:57,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=675924.0, ans=0.0 2023-06-21 00:27:02,455 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:27:33,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-21 00:27:40,832 INFO [train.py:996] (0/4) Epoch 4, batch 21200, loss[loss=0.2136, simple_loss=0.2758, pruned_loss=0.07563, over 21766.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2896, pruned_loss=0.07952, over 4262328.97 frames. ], batch size: 316, lr: 7.76e-03, grad_scale: 32.0 2023-06-21 00:28:18,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=676224.0, ans=0.0 2023-06-21 00:28:40,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=676284.0, ans=0.035 2023-06-21 00:28:41,444 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.500e+02 2.845e+02 3.451e+02 6.035e+02, threshold=5.690e+02, percent-clipped=1.0 2023-06-21 00:28:59,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.94 vs. limit=22.5 2023-06-21 00:29:18,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=676404.0, ans=0.0 2023-06-21 00:29:19,082 INFO [train.py:996] (0/4) Epoch 4, batch 21250, loss[loss=0.2437, simple_loss=0.3362, pruned_loss=0.07559, over 19708.00 frames. ], tot_loss[loss=0.223, simple_loss=0.288, pruned_loss=0.07903, over 4263809.54 frames. ], batch size: 702, lr: 7.75e-03, grad_scale: 32.0 2023-06-21 00:30:41,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=676644.0, ans=0.1 2023-06-21 00:30:53,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-21 00:30:55,550 INFO [train.py:996] (0/4) Epoch 4, batch 21300, loss[loss=0.2654, simple_loss=0.3562, pruned_loss=0.08728, over 20771.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2946, pruned_loss=0.08101, over 4264856.41 frames. ], batch size: 608, lr: 7.75e-03, grad_scale: 32.0 2023-06-21 00:30:59,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=676704.0, ans=0.125 2023-06-21 00:32:01,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.763e+02 3.058e+02 3.413e+02 5.762e+02, threshold=6.115e+02, percent-clipped=1.0 2023-06-21 00:32:31,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677004.0, ans=0.1 2023-06-21 00:32:32,305 INFO [train.py:996] (0/4) Epoch 4, batch 21350, loss[loss=0.2567, simple_loss=0.3193, pruned_loss=0.09705, over 21353.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2991, pruned_loss=0.08214, over 4272554.89 frames. ], batch size: 159, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:32:49,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=677004.0, ans=0.0 2023-06-21 00:33:06,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=8.0 2023-06-21 00:33:31,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=677184.0, ans=0.125 2023-06-21 00:33:55,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-21 00:34:07,674 INFO [train.py:996] (0/4) Epoch 4, batch 21400, loss[loss=0.1976, simple_loss=0.2847, pruned_loss=0.05525, over 21424.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3013, pruned_loss=0.08158, over 4277516.11 frames. ], batch size: 211, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:34:12,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=677304.0, ans=0.125 2023-06-21 00:34:32,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=677364.0, ans=0.125 2023-06-21 00:35:27,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677484.0, ans=0.1 2023-06-21 00:35:29,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=677484.0, ans=0.2 2023-06-21 00:35:34,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.398e+02 2.726e+02 3.426e+02 6.163e+02, threshold=5.451e+02, percent-clipped=1.0 2023-06-21 00:35:36,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=677484.0, ans=0.0 2023-06-21 00:35:37,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=677484.0, ans=15.0 2023-06-21 00:36:02,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=677604.0, ans=0.0 2023-06-21 00:36:03,775 INFO [train.py:996] (0/4) Epoch 4, batch 21450, loss[loss=0.2866, simple_loss=0.3373, pruned_loss=0.118, over 21607.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3054, pruned_loss=0.08325, over 4286679.29 frames. ], batch size: 471, lr: 7.75e-03, grad_scale: 16.0 2023-06-21 00:36:17,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=677664.0, ans=0.125 2023-06-21 00:37:13,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=677784.0, ans=0.2 2023-06-21 00:37:45,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=677904.0, ans=0.125 2023-06-21 00:37:46,119 INFO [train.py:996] (0/4) Epoch 4, batch 21500, loss[loss=0.2319, simple_loss=0.3062, pruned_loss=0.07877, over 20846.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3049, pruned_loss=0.0845, over 4270326.95 frames. ], batch size: 607, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:38:05,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-21 00:38:26,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=677964.0, ans=0.2 2023-06-21 00:38:29,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=677964.0, ans=0.2 2023-06-21 00:39:13,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 2.885e+02 3.448e+02 4.361e+02 7.505e+02, threshold=6.896e+02, percent-clipped=8.0 2023-06-21 00:39:43,568 INFO [train.py:996] (0/4) Epoch 4, batch 21550, loss[loss=0.2238, simple_loss=0.277, pruned_loss=0.08534, over 21193.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2976, pruned_loss=0.0816, over 4253249.42 frames. ], batch size: 176, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:39:46,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-21 00:39:51,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=678204.0, ans=0.125 2023-06-21 00:40:07,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=678264.0, ans=0.0 2023-06-21 00:40:12,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=678264.0, ans=0.125 2023-06-21 00:40:23,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=678324.0, ans=0.125 2023-06-21 00:40:27,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-21 00:40:30,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=678324.0, ans=0.0 2023-06-21 00:40:30,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678324.0, ans=0.1 2023-06-21 00:40:30,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-21 00:41:11,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-21 00:41:20,999 INFO [train.py:996] (0/4) Epoch 4, batch 21600, loss[loss=0.2214, simple_loss=0.2779, pruned_loss=0.08243, over 21537.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2938, pruned_loss=0.0803, over 4256721.47 frames. ], batch size: 442, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:41:22,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-21 00:41:55,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=678564.0, ans=0.125 2023-06-21 00:42:46,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678684.0, ans=0.1 2023-06-21 00:42:48,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.277e+02 2.634e+02 3.164e+02 4.314e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-21 00:42:55,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678684.0, ans=0.1 2023-06-21 00:43:12,688 INFO [train.py:996] (0/4) Epoch 4, batch 21650, loss[loss=0.2088, simple_loss=0.288, pruned_loss=0.06479, over 21810.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2962, pruned_loss=0.07813, over 4250181.19 frames. ], batch size: 102, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:43:36,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=678864.0, ans=0.125 2023-06-21 00:44:15,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=678924.0, ans=0.07 2023-06-21 00:44:22,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=678984.0, ans=0.07 2023-06-21 00:44:28,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=678984.0, ans=0.0 2023-06-21 00:45:01,846 INFO [train.py:996] (0/4) Epoch 4, batch 21700, loss[loss=0.2039, simple_loss=0.2967, pruned_loss=0.05557, over 21749.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2956, pruned_loss=0.0762, over 4262157.96 frames. ], batch size: 298, lr: 7.74e-03, grad_scale: 32.0 2023-06-21 00:45:54,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=679224.0, ans=0.0 2023-06-21 00:46:23,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=679284.0, ans=0.125 2023-06-21 00:46:24,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 2.270e+02 2.667e+02 3.310e+02 7.431e+02, threshold=5.334e+02, percent-clipped=8.0 2023-06-21 00:46:46,220 INFO [train.py:996] (0/4) Epoch 4, batch 21750, loss[loss=0.24, simple_loss=0.2904, pruned_loss=0.09482, over 21569.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2921, pruned_loss=0.07602, over 4265161.09 frames. ], batch size: 415, lr: 7.74e-03, grad_scale: 16.0 2023-06-21 00:47:03,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=679404.0, ans=0.125 2023-06-21 00:48:44,938 INFO [train.py:996] (0/4) Epoch 4, batch 21800, loss[loss=0.2111, simple_loss=0.2749, pruned_loss=0.07361, over 21636.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2897, pruned_loss=0.0768, over 4261606.05 frames. ], batch size: 264, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:48:45,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=679704.0, ans=0.125 2023-06-21 00:49:56,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=679884.0, ans=0.1 2023-06-21 00:50:04,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.568e+02 2.919e+02 3.829e+02 7.614e+02, threshold=5.838e+02, percent-clipped=7.0 2023-06-21 00:50:35,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=680004.0, ans=0.125 2023-06-21 00:50:41,349 INFO [train.py:996] (0/4) Epoch 4, batch 21850, loss[loss=0.2981, simple_loss=0.3509, pruned_loss=0.1226, over 21688.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2952, pruned_loss=0.07755, over 4264418.31 frames. ], batch size: 507, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:50:59,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=680004.0, ans=0.125 2023-06-21 00:51:12,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=680064.0, ans=0.0 2023-06-21 00:51:26,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=680064.0, ans=0.125 2023-06-21 00:51:54,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=680124.0, ans=0.2 2023-06-21 00:52:29,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=680244.0, ans=0.125 2023-06-21 00:52:47,289 INFO [train.py:996] (0/4) Epoch 4, batch 21900, loss[loss=0.2083, simple_loss=0.2867, pruned_loss=0.06493, over 19774.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2972, pruned_loss=0.07896, over 4268547.00 frames. ], batch size: 702, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:53:02,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=680304.0, ans=0.0 2023-06-21 00:54:01,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.548e+02 2.891e+02 3.393e+02 5.559e+02, threshold=5.783e+02, percent-clipped=0.0 2023-06-21 00:54:06,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=680484.0, ans=0.2 2023-06-21 00:54:23,719 INFO [train.py:996] (0/4) Epoch 4, batch 21950, loss[loss=0.2038, simple_loss=0.2796, pruned_loss=0.06398, over 21526.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2936, pruned_loss=0.07898, over 4268672.88 frames. ], batch size: 441, lr: 7.73e-03, grad_scale: 16.0 2023-06-21 00:54:46,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=680604.0, ans=0.2 2023-06-21 00:55:12,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=680664.0, ans=0.125 2023-06-21 00:55:56,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-21 00:56:13,982 INFO [train.py:996] (0/4) Epoch 4, batch 22000, loss[loss=0.1842, simple_loss=0.2547, pruned_loss=0.05687, over 21438.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2862, pruned_loss=0.07451, over 4269373.95 frames. ], batch size: 195, lr: 7.73e-03, grad_scale: 32.0 2023-06-21 00:56:55,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=680964.0, ans=0.125 2023-06-21 00:57:34,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=681084.0, ans=0.125 2023-06-21 00:57:46,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 2.067e+02 2.288e+02 2.758e+02 6.986e+02, threshold=4.576e+02, percent-clipped=2.0 2023-06-21 00:58:08,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=681144.0, ans=0.125 2023-06-21 00:58:35,068 INFO [train.py:996] (0/4) Epoch 4, batch 22050, loss[loss=0.2655, simple_loss=0.3343, pruned_loss=0.0983, over 21243.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2895, pruned_loss=0.07577, over 4263597.52 frames. ], batch size: 143, lr: 7.73e-03, grad_scale: 32.0 2023-06-21 00:58:38,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681204.0, ans=0.1 2023-06-21 00:58:49,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681264.0, ans=0.1 2023-06-21 00:58:57,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=681264.0, ans=0.5 2023-06-21 01:00:02,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-21 01:00:15,413 INFO [train.py:996] (0/4) Epoch 4, batch 22100, loss[loss=0.2327, simple_loss=0.3015, pruned_loss=0.08201, over 21844.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3022, pruned_loss=0.08142, over 4267623.67 frames. ], batch size: 282, lr: 7.72e-03, grad_scale: 32.0 2023-06-21 01:00:22,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.18 vs. limit=15.0 2023-06-21 01:00:39,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=681564.0, ans=0.125 2023-06-21 01:01:22,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.888e+02 3.349e+02 4.124e+02 5.904e+02, threshold=6.697e+02, percent-clipped=15.0 2023-06-21 01:02:13,231 INFO [train.py:996] (0/4) Epoch 4, batch 22150, loss[loss=0.2363, simple_loss=0.3042, pruned_loss=0.08426, over 20882.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3053, pruned_loss=0.08378, over 4279203.60 frames. ], batch size: 608, lr: 7.72e-03, grad_scale: 32.0 2023-06-21 01:02:15,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-21 01:03:03,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=15.0 2023-06-21 01:03:13,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=681984.0, ans=0.09899494936611666 2023-06-21 01:03:27,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-21 01:04:07,383 INFO [train.py:996] (0/4) Epoch 4, batch 22200, loss[loss=0.2421, simple_loss=0.3254, pruned_loss=0.07941, over 21254.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3074, pruned_loss=0.08519, over 4279028.41 frames. ], batch size: 159, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:05:17,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.532e+02 2.842e+02 3.263e+02 4.792e+02, threshold=5.684e+02, percent-clipped=0.0 2023-06-21 01:05:36,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682344.0, ans=0.1 2023-06-21 01:06:03,322 INFO [train.py:996] (0/4) Epoch 4, batch 22250, loss[loss=0.279, simple_loss=0.3502, pruned_loss=0.1039, over 21330.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3149, pruned_loss=0.08617, over 4286496.43 frames. ], batch size: 143, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:06:12,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=682404.0, ans=0.125 2023-06-21 01:07:37,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=682644.0, ans=0.2 2023-06-21 01:07:52,738 INFO [train.py:996] (0/4) Epoch 4, batch 22300, loss[loss=0.2626, simple_loss=0.3348, pruned_loss=0.0952, over 20757.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3175, pruned_loss=0.08864, over 4284973.86 frames. ], batch size: 607, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:08:35,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.00 vs. limit=12.0 2023-06-21 01:08:48,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=682764.0, ans=0.0 2023-06-21 01:09:06,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=682884.0, ans=0.0 2023-06-21 01:09:14,367 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-21 01:09:16,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.961e+02 3.295e+02 3.929e+02 6.165e+02, threshold=6.589e+02, percent-clipped=1.0 2023-06-21 01:09:37,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-21 01:09:57,919 INFO [train.py:996] (0/4) Epoch 4, batch 22350, loss[loss=0.254, simple_loss=0.3068, pruned_loss=0.1006, over 21310.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3167, pruned_loss=0.08975, over 4287196.48 frames. ], batch size: 143, lr: 7.72e-03, grad_scale: 16.0 2023-06-21 01:10:25,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683004.0, ans=0.1 2023-06-21 01:10:39,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=683064.0, ans=0.125 2023-06-21 01:10:40,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=683064.0, ans=0.125 2023-06-21 01:11:54,222 INFO [train.py:996] (0/4) Epoch 4, batch 22400, loss[loss=0.2211, simple_loss=0.3045, pruned_loss=0.06886, over 21192.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3129, pruned_loss=0.0865, over 4289187.28 frames. ], batch size: 548, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:12:58,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.385e+02 2.684e+02 3.038e+02 4.758e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 01:12:59,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=683484.0, ans=0.0 2023-06-21 01:13:33,329 INFO [train.py:996] (0/4) Epoch 4, batch 22450, loss[loss=0.2091, simple_loss=0.267, pruned_loss=0.07553, over 21496.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3065, pruned_loss=0.08555, over 4270346.84 frames. ], batch size: 230, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:14:41,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=683784.0, ans=0.04949747468305833 2023-06-21 01:15:40,179 INFO [train.py:996] (0/4) Epoch 4, batch 22500, loss[loss=0.2403, simple_loss=0.3164, pruned_loss=0.0821, over 21486.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3029, pruned_loss=0.08495, over 4277730.60 frames. ], batch size: 230, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:15:42,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=683904.0, ans=0.0 2023-06-21 01:16:42,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=684024.0, ans=0.0 2023-06-21 01:16:53,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.672e+02 3.079e+02 3.545e+02 7.228e+02, threshold=6.157e+02, percent-clipped=7.0 2023-06-21 01:17:23,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=684144.0, ans=0.0 2023-06-21 01:17:43,734 INFO [train.py:996] (0/4) Epoch 4, batch 22550, loss[loss=0.256, simple_loss=0.323, pruned_loss=0.09445, over 21752.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3067, pruned_loss=0.08543, over 4274668.06 frames. ], batch size: 441, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:17:51,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-21 01:18:10,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-21 01:19:14,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=684444.0, ans=0.0 2023-06-21 01:19:30,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=684444.0, ans=0.0 2023-06-21 01:19:33,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684504.0, ans=0.1 2023-06-21 01:19:34,958 INFO [train.py:996] (0/4) Epoch 4, batch 22600, loss[loss=0.1784, simple_loss=0.2365, pruned_loss=0.06013, over 21222.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3094, pruned_loss=0.08539, over 4279142.69 frames. ], batch size: 143, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:19:39,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.63 vs. limit=10.0 2023-06-21 01:20:25,409 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:20:55,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.560e+02 2.970e+02 3.484e+02 5.965e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-21 01:21:05,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=684744.0, ans=0.125 2023-06-21 01:21:31,539 INFO [train.py:996] (0/4) Epoch 4, batch 22650, loss[loss=0.2211, simple_loss=0.2833, pruned_loss=0.07945, over 21738.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3069, pruned_loss=0.08475, over 4277672.99 frames. ], batch size: 112, lr: 7.71e-03, grad_scale: 32.0 2023-06-21 01:21:47,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=684864.0, ans=0.0 2023-06-21 01:21:50,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=684864.0, ans=0.125 2023-06-21 01:23:06,099 INFO [train.py:996] (0/4) Epoch 4, batch 22700, loss[loss=0.2264, simple_loss=0.2965, pruned_loss=0.07811, over 21819.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3006, pruned_loss=0.08372, over 4277301.76 frames. ], batch size: 102, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:23:29,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.88 vs. limit=22.5 2023-06-21 01:23:34,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-21 01:23:40,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=685224.0, ans=0.2 2023-06-21 01:23:50,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685224.0, ans=0.1 2023-06-21 01:24:16,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.521e+02 2.946e+02 3.485e+02 5.279e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-21 01:24:42,701 INFO [train.py:996] (0/4) Epoch 4, batch 22750, loss[loss=0.2725, simple_loss=0.3319, pruned_loss=0.1066, over 21762.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3023, pruned_loss=0.08674, over 4277733.17 frames. ], batch size: 332, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:24:51,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=685404.0, ans=0.125 2023-06-21 01:26:18,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=685584.0, ans=0.125 2023-06-21 01:26:43,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-21 01:26:48,310 INFO [train.py:996] (0/4) Epoch 4, batch 22800, loss[loss=0.2267, simple_loss=0.3058, pruned_loss=0.07385, over 21458.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3072, pruned_loss=0.08875, over 4271121.16 frames. ], batch size: 131, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:27:03,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=685764.0, ans=0.0 2023-06-21 01:27:31,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=685824.0, ans=0.5 2023-06-21 01:27:36,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685824.0, ans=0.1 2023-06-21 01:28:02,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.664e+02 3.134e+02 3.732e+02 6.258e+02, threshold=6.268e+02, percent-clipped=2.0 2023-06-21 01:28:27,903 INFO [train.py:996] (0/4) Epoch 4, batch 22850, loss[loss=0.2189, simple_loss=0.2852, pruned_loss=0.07629, over 21581.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3032, pruned_loss=0.08771, over 4280919.47 frames. ], batch size: 263, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:29:04,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=686064.0, ans=0.125 2023-06-21 01:29:04,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-21 01:30:03,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=686184.0, ans=0.0 2023-06-21 01:30:11,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=686184.0, ans=0.0 2023-06-21 01:30:14,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=686184.0, ans=0.125 2023-06-21 01:30:17,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=686244.0, ans=0.0 2023-06-21 01:30:32,288 INFO [train.py:996] (0/4) Epoch 4, batch 22900, loss[loss=0.2741, simple_loss=0.3695, pruned_loss=0.08933, over 21676.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3037, pruned_loss=0.08679, over 4274977.75 frames. ], batch size: 389, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:30:55,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=686364.0, ans=0.125 2023-06-21 01:30:57,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=686364.0, ans=0.125 2023-06-21 01:31:08,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=686364.0, ans=0.2 2023-06-21 01:31:51,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-21 01:31:53,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=686424.0, ans=0.125 2023-06-21 01:32:28,763 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.538e+02 2.990e+02 3.640e+02 6.063e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 01:32:53,721 INFO [train.py:996] (0/4) Epoch 4, batch 22950, loss[loss=0.2413, simple_loss=0.3649, pruned_loss=0.05886, over 21737.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.317, pruned_loss=0.08565, over 4271338.19 frames. ], batch size: 332, lr: 7.70e-03, grad_scale: 32.0 2023-06-21 01:33:14,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=686604.0, ans=0.0 2023-06-21 01:33:16,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-21 01:33:40,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=686664.0, ans=0.125 2023-06-21 01:34:00,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=686724.0, ans=0.125 2023-06-21 01:34:26,295 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:35:05,442 INFO [train.py:996] (0/4) Epoch 4, batch 23000, loss[loss=0.2393, simple_loss=0.3226, pruned_loss=0.07799, over 19942.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3155, pruned_loss=0.08228, over 4270552.31 frames. ], batch size: 703, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:35:06,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-21 01:35:56,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=686964.0, ans=0.125 2023-06-21 01:35:59,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=686964.0, ans=0.0 2023-06-21 01:36:17,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-21 01:36:23,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=687084.0, ans=0.125 2023-06-21 01:36:32,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.398e+02 2.749e+02 3.269e+02 6.833e+02, threshold=5.498e+02, percent-clipped=2.0 2023-06-21 01:37:10,577 INFO [train.py:996] (0/4) Epoch 4, batch 23050, loss[loss=0.2451, simple_loss=0.3158, pruned_loss=0.08718, over 21806.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3171, pruned_loss=0.08459, over 4273732.44 frames. ], batch size: 282, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:37:22,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=687204.0, ans=0.0 2023-06-21 01:37:53,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=687264.0, ans=0.125 2023-06-21 01:37:53,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-21 01:37:57,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=687324.0, ans=0.125 2023-06-21 01:38:25,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=687384.0, ans=0.0 2023-06-21 01:38:32,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=687384.0, ans=0.1 2023-06-21 01:39:15,338 INFO [train.py:996] (0/4) Epoch 4, batch 23100, loss[loss=0.2015, simple_loss=0.2592, pruned_loss=0.07195, over 21197.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3132, pruned_loss=0.08505, over 4271862.98 frames. ], batch size: 549, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:40:19,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=687624.0, ans=0.125 2023-06-21 01:40:36,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.504e+02 2.792e+02 3.257e+02 4.712e+02, threshold=5.583e+02, percent-clipped=0.0 2023-06-21 01:40:43,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=687684.0, ans=0.125 2023-06-21 01:40:54,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=687744.0, ans=0.1 2023-06-21 01:41:10,867 INFO [train.py:996] (0/4) Epoch 4, batch 23150, loss[loss=0.2468, simple_loss=0.3089, pruned_loss=0.09231, over 21284.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3076, pruned_loss=0.08438, over 4275822.00 frames. ], batch size: 143, lr: 7.69e-03, grad_scale: 16.0 2023-06-21 01:41:11,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=687804.0, ans=0.2 2023-06-21 01:42:17,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-21 01:42:25,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=687924.0, ans=0.2 2023-06-21 01:42:33,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=687984.0, ans=0.0 2023-06-21 01:43:02,826 INFO [train.py:996] (0/4) Epoch 4, batch 23200, loss[loss=0.2298, simple_loss=0.2934, pruned_loss=0.08305, over 21869.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3055, pruned_loss=0.08443, over 4281933.45 frames. ], batch size: 247, lr: 7.69e-03, grad_scale: 32.0 2023-06-21 01:43:23,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=688104.0, ans=0.125 2023-06-21 01:43:47,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=688164.0, ans=0.1 2023-06-21 01:44:27,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.558e+02 3.014e+02 3.330e+02 5.283e+02, threshold=6.028e+02, percent-clipped=0.0 2023-06-21 01:44:32,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=688344.0, ans=0.0 2023-06-21 01:44:48,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-21 01:45:06,671 INFO [train.py:996] (0/4) Epoch 4, batch 23250, loss[loss=0.219, simple_loss=0.2901, pruned_loss=0.07389, over 21960.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3062, pruned_loss=0.08598, over 4291153.22 frames. ], batch size: 316, lr: 7.69e-03, grad_scale: 32.0 2023-06-21 01:45:07,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=688404.0, ans=0.0 2023-06-21 01:45:13,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=688404.0, ans=0.2 2023-06-21 01:45:50,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=688464.0, ans=0.025 2023-06-21 01:46:15,221 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-21 01:46:25,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=688584.0, ans=0.0 2023-06-21 01:46:46,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=688584.0, ans=0.125 2023-06-21 01:47:15,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-21 01:47:16,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=688704.0, ans=0.0 2023-06-21 01:47:17,193 INFO [train.py:996] (0/4) Epoch 4, batch 23300, loss[loss=0.2859, simple_loss=0.3909, pruned_loss=0.09039, over 21764.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3149, pruned_loss=0.08776, over 4289273.77 frames. ], batch size: 351, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:47:18,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-21 01:48:49,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=688884.0, ans=0.125 2023-06-21 01:48:50,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.610e+02 2.927e+02 3.361e+02 4.958e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 01:49:20,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=688944.0, ans=0.2 2023-06-21 01:49:30,644 INFO [train.py:996] (0/4) Epoch 4, batch 23350, loss[loss=0.2179, simple_loss=0.3051, pruned_loss=0.06538, over 20760.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.319, pruned_loss=0.08681, over 4275124.24 frames. ], batch size: 607, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:50:55,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=689244.0, ans=0.025 2023-06-21 01:51:04,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=689244.0, ans=0.2 2023-06-21 01:51:21,098 INFO [train.py:996] (0/4) Epoch 4, batch 23400, loss[loss=0.1822, simple_loss=0.2592, pruned_loss=0.05263, over 21405.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3121, pruned_loss=0.08288, over 4277426.87 frames. ], batch size: 211, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:52:34,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=689424.0, ans=0.0 2023-06-21 01:52:58,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.326e+02 2.655e+02 3.143e+02 5.119e+02, threshold=5.310e+02, percent-clipped=0.0 2023-06-21 01:53:33,732 INFO [train.py:996] (0/4) Epoch 4, batch 23450, loss[loss=0.2654, simple_loss=0.3274, pruned_loss=0.1017, over 21513.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.312, pruned_loss=0.08486, over 4271196.70 frames. ], batch size: 194, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:54:18,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=689724.0, ans=0.1 2023-06-21 01:54:52,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=689784.0, ans=0.1 2023-06-21 01:55:27,870 INFO [train.py:996] (0/4) Epoch 4, batch 23500, loss[loss=0.223, simple_loss=0.2947, pruned_loss=0.07568, over 21886.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3132, pruned_loss=0.08706, over 4274258.14 frames. ], batch size: 332, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:55:49,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=689964.0, ans=0.0 2023-06-21 01:56:13,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=690024.0, ans=0.025 2023-06-21 01:56:16,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=690024.0, ans=0.125 2023-06-21 01:56:20,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=690084.0, ans=0.2 2023-06-21 01:56:20,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=690084.0, ans=0.125 2023-06-21 01:56:28,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.673e+02 3.044e+02 3.448e+02 4.722e+02, threshold=6.088e+02, percent-clipped=0.0 2023-06-21 01:56:30,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-06-21 01:57:04,421 INFO [train.py:996] (0/4) Epoch 4, batch 23550, loss[loss=0.2164, simple_loss=0.2724, pruned_loss=0.08021, over 21273.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3079, pruned_loss=0.08655, over 4273339.15 frames. ], batch size: 159, lr: 7.68e-03, grad_scale: 32.0 2023-06-21 01:57:19,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=690264.0, ans=12.0 2023-06-21 01:59:09,019 INFO [train.py:996] (0/4) Epoch 4, batch 23600, loss[loss=0.2504, simple_loss=0.3212, pruned_loss=0.08976, over 21802.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3092, pruned_loss=0.0866, over 4265407.90 frames. ], batch size: 282, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 01:59:09,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=690504.0, ans=0.0 2023-06-21 01:59:16,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=690504.0, ans=0.0 2023-06-21 02:00:20,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-21 02:00:30,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=690684.0, ans=0.125 2023-06-21 02:00:35,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=690684.0, ans=0.125 2023-06-21 02:00:52,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.585e+02 3.175e+02 3.942e+02 6.182e+02, threshold=6.349e+02, percent-clipped=1.0 2023-06-21 02:01:08,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=690744.0, ans=0.05 2023-06-21 02:01:13,087 INFO [train.py:996] (0/4) Epoch 4, batch 23650, loss[loss=0.2205, simple_loss=0.2529, pruned_loss=0.09411, over 20068.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3086, pruned_loss=0.08475, over 4262884.45 frames. ], batch size: 704, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:02:38,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-21 02:02:51,055 INFO [train.py:996] (0/4) Epoch 4, batch 23700, loss[loss=0.1984, simple_loss=0.2845, pruned_loss=0.05619, over 21761.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3119, pruned_loss=0.08395, over 4266628.70 frames. ], batch size: 332, lr: 7.67e-03, grad_scale: 32.0 2023-06-21 02:04:05,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=691284.0, ans=0.125 2023-06-21 02:04:09,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.529e+02 2.918e+02 3.500e+02 6.134e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 02:04:27,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=691344.0, ans=0.0 2023-06-21 02:04:28,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=691344.0, ans=0.125 2023-06-21 02:04:36,869 INFO [train.py:996] (0/4) Epoch 4, batch 23750, loss[loss=0.2218, simple_loss=0.3207, pruned_loss=0.06151, over 21648.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3152, pruned_loss=0.08541, over 4269402.54 frames. ], batch size: 389, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:05:44,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-21 02:05:48,122 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:05:53,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-21 02:06:03,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=691584.0, ans=0.125 2023-06-21 02:06:20,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=691644.0, ans=0.035 2023-06-21 02:06:39,763 INFO [train.py:996] (0/4) Epoch 4, batch 23800, loss[loss=0.2796, simple_loss=0.3892, pruned_loss=0.08502, over 20779.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3134, pruned_loss=0.08328, over 4268267.25 frames. ], batch size: 607, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:07:42,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=691764.0, ans=0.1 2023-06-21 02:07:42,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691764.0, ans=0.1 2023-06-21 02:07:59,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=691884.0, ans=0.125 2023-06-21 02:07:59,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=691884.0, ans=0.5 2023-06-21 02:08:13,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=691884.0, ans=0.2 2023-06-21 02:08:15,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=691884.0, ans=0.0 2023-06-21 02:08:17,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.570e+02 3.095e+02 3.507e+02 5.751e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-21 02:09:00,381 INFO [train.py:996] (0/4) Epoch 4, batch 23850, loss[loss=0.256, simple_loss=0.3247, pruned_loss=0.09363, over 21338.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3225, pruned_loss=0.08598, over 4268874.94 frames. ], batch size: 176, lr: 7.67e-03, grad_scale: 16.0 2023-06-21 02:09:14,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=692004.0, ans=0.0 2023-06-21 02:09:17,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=692004.0, ans=0.125 2023-06-21 02:10:12,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=692184.0, ans=0.125 2023-06-21 02:10:16,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-21 02:10:33,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=692244.0, ans=0.125 2023-06-21 02:10:33,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=692244.0, ans=0.125 2023-06-21 02:10:42,683 INFO [train.py:996] (0/4) Epoch 4, batch 23900, loss[loss=0.2404, simple_loss=0.3168, pruned_loss=0.08197, over 21818.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3308, pruned_loss=0.08862, over 4276262.08 frames. ], batch size: 124, lr: 7.66e-03, grad_scale: 16.0 2023-06-21 02:11:35,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-21 02:12:06,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.492e+02 2.802e+02 3.240e+02 5.431e+02, threshold=5.603e+02, percent-clipped=0.0 2023-06-21 02:12:35,445 INFO [train.py:996] (0/4) Epoch 4, batch 23950, loss[loss=0.2437, simple_loss=0.3125, pruned_loss=0.08744, over 21298.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3233, pruned_loss=0.08816, over 4255124.20 frames. ], batch size: 176, lr: 7.66e-03, grad_scale: 16.0 2023-06-21 02:14:05,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=692784.0, ans=0.125 2023-06-21 02:14:43,752 INFO [train.py:996] (0/4) Epoch 4, batch 24000, loss[loss=0.2548, simple_loss=0.3262, pruned_loss=0.09171, over 20652.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3241, pruned_loss=0.09101, over 4255417.75 frames. ], batch size: 607, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:14:43,754 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 02:15:40,417 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.268, simple_loss=0.3653, pruned_loss=0.08536, over 1796401.00 frames. 2023-06-21 02:15:40,418 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23701MB 2023-06-21 02:16:04,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=692964.0, ans=0.125 2023-06-21 02:16:20,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-21 02:16:56,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.654e+02 3.041e+02 3.636e+02 5.941e+02, threshold=6.083e+02, percent-clipped=2.0 2023-06-21 02:17:26,611 INFO [train.py:996] (0/4) Epoch 4, batch 24050, loss[loss=0.1943, simple_loss=0.2887, pruned_loss=0.05002, over 21647.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3267, pruned_loss=0.09182, over 4253480.31 frames. ], batch size: 263, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:17:33,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=15.0 2023-06-21 02:18:05,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=693264.0, ans=0.125 2023-06-21 02:19:17,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=693444.0, ans=0.0 2023-06-21 02:19:17,228 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:19:19,533 INFO [train.py:996] (0/4) Epoch 4, batch 24100, loss[loss=0.2665, simple_loss=0.3468, pruned_loss=0.09304, over 21868.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3275, pruned_loss=0.09037, over 4263607.90 frames. ], batch size: 371, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:19:45,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=693504.0, ans=0.07 2023-06-21 02:20:41,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=693624.0, ans=0.0 2023-06-21 02:20:55,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=693684.0, ans=0.1 2023-06-21 02:21:09,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.445e+02 2.792e+02 3.280e+02 5.618e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-21 02:21:27,700 INFO [train.py:996] (0/4) Epoch 4, batch 24150, loss[loss=0.294, simple_loss=0.3571, pruned_loss=0.1155, over 21854.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3264, pruned_loss=0.09128, over 4269505.02 frames. ], batch size: 124, lr: 7.66e-03, grad_scale: 32.0 2023-06-21 02:21:45,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=693804.0, ans=0.125 2023-06-21 02:21:52,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-21 02:22:53,864 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:23:39,050 INFO [train.py:996] (0/4) Epoch 4, batch 24200, loss[loss=0.282, simple_loss=0.3673, pruned_loss=0.09835, over 21665.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3278, pruned_loss=0.09251, over 4275518.03 frames. ], batch size: 414, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:24:19,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=694224.0, ans=0.0 2023-06-21 02:24:31,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=694224.0, ans=10.0 2023-06-21 02:25:01,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=694284.0, ans=0.125 2023-06-21 02:25:12,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-21 02:25:14,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.339e+02 2.793e+02 3.596e+02 6.031e+02, threshold=5.587e+02, percent-clipped=1.0 2023-06-21 02:25:38,068 INFO [train.py:996] (0/4) Epoch 4, batch 24250, loss[loss=0.1893, simple_loss=0.2752, pruned_loss=0.05175, over 21178.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3245, pruned_loss=0.08558, over 4268842.64 frames. ], batch size: 159, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:25:53,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-21 02:26:07,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=694464.0, ans=0.0 2023-06-21 02:27:16,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=694584.0, ans=0.1 2023-06-21 02:27:27,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=694584.0, ans=0.2 2023-06-21 02:28:01,386 INFO [train.py:996] (0/4) Epoch 4, batch 24300, loss[loss=0.2176, simple_loss=0.2946, pruned_loss=0.0703, over 21672.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3146, pruned_loss=0.07879, over 4268420.05 frames. ], batch size: 441, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:28:18,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-21 02:28:53,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-06-21 02:29:19,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 2.168e+02 3.072e+02 4.276e+02 8.509e+02, threshold=6.143e+02, percent-clipped=10.0 2023-06-21 02:29:56,145 INFO [train.py:996] (0/4) Epoch 4, batch 24350, loss[loss=0.2521, simple_loss=0.3151, pruned_loss=0.09452, over 21340.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3103, pruned_loss=0.07863, over 4268011.41 frames. ], batch size: 176, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:31:31,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=695244.0, ans=0.2 2023-06-21 02:32:00,367 INFO [train.py:996] (0/4) Epoch 4, batch 24400, loss[loss=0.2805, simple_loss=0.3564, pruned_loss=0.1023, over 21459.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3165, pruned_loss=0.08377, over 4274170.71 frames. ], batch size: 131, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:32:00,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=695304.0, ans=0.125 2023-06-21 02:32:32,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=695364.0, ans=0.125 2023-06-21 02:32:32,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=695364.0, ans=0.0 2023-06-21 02:33:09,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695484.0, ans=0.1 2023-06-21 02:33:26,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.702e+02 3.028e+02 3.520e+02 5.844e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-21 02:33:59,835 INFO [train.py:996] (0/4) Epoch 4, batch 24450, loss[loss=0.2118, simple_loss=0.2797, pruned_loss=0.07196, over 21846.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.321, pruned_loss=0.08537, over 4270103.39 frames. ], batch size: 98, lr: 7.65e-03, grad_scale: 32.0 2023-06-21 02:34:03,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=695604.0, ans=0.125 2023-06-21 02:35:16,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-21 02:35:50,231 INFO [train.py:996] (0/4) Epoch 4, batch 24500, loss[loss=0.248, simple_loss=0.3101, pruned_loss=0.09301, over 21921.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3198, pruned_loss=0.0853, over 4274603.76 frames. ], batch size: 351, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:36:32,005 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-116000.pt 2023-06-21 02:36:39,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=695964.0, ans=0.07 2023-06-21 02:36:47,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=696024.0, ans=15.0 2023-06-21 02:36:59,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-06-21 02:37:15,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=696084.0, ans=0.0 2023-06-21 02:37:18,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.487e+02 2.699e+02 3.128e+02 4.356e+02, threshold=5.399e+02, percent-clipped=0.0 2023-06-21 02:37:33,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.43 vs. limit=10.0 2023-06-21 02:37:33,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=696144.0, ans=0.125 2023-06-21 02:37:59,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=696144.0, ans=0.0 2023-06-21 02:38:03,314 INFO [train.py:996] (0/4) Epoch 4, batch 24550, loss[loss=0.2533, simple_loss=0.2889, pruned_loss=0.1089, over 20245.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3224, pruned_loss=0.08847, over 4268124.41 frames. ], batch size: 703, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:38:26,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-21 02:38:47,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=696324.0, ans=0.0 2023-06-21 02:38:53,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-06-21 02:39:37,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=696444.0, ans=0.2 2023-06-21 02:39:57,895 INFO [train.py:996] (0/4) Epoch 4, batch 24600, loss[loss=0.2428, simple_loss=0.2855, pruned_loss=0.1001, over 21229.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3173, pruned_loss=0.08804, over 4263672.42 frames. ], batch size: 548, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:40:12,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=696504.0, ans=0.125 2023-06-21 02:40:16,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-21 02:40:26,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=696564.0, ans=0.125 2023-06-21 02:40:30,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-21 02:41:15,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.636e+02 3.077e+02 3.640e+02 5.149e+02, threshold=6.153e+02, percent-clipped=0.0 2023-06-21 02:41:49,319 INFO [train.py:996] (0/4) Epoch 4, batch 24650, loss[loss=0.1924, simple_loss=0.2495, pruned_loss=0.06767, over 21513.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3089, pruned_loss=0.08613, over 4258284.52 frames. ], batch size: 196, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:42:07,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=696804.0, ans=0.125 2023-06-21 02:42:23,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=696864.0, ans=0.1 2023-06-21 02:42:31,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=696924.0, ans=0.0 2023-06-21 02:42:58,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=696924.0, ans=0.125 2023-06-21 02:43:01,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696984.0, ans=0.1 2023-06-21 02:43:42,984 INFO [train.py:996] (0/4) Epoch 4, batch 24700, loss[loss=0.2416, simple_loss=0.3336, pruned_loss=0.07478, over 21681.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3067, pruned_loss=0.08371, over 4252810.74 frames. ], batch size: 414, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:44:06,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=697164.0, ans=0.125 2023-06-21 02:44:07,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=697164.0, ans=0.035 2023-06-21 02:45:05,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.595e+02 3.081e+02 3.646e+02 7.591e+02, threshold=6.163e+02, percent-clipped=1.0 2023-06-21 02:45:19,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=697344.0, ans=0.0 2023-06-21 02:45:26,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.78 vs. limit=15.0 2023-06-21 02:45:37,772 INFO [train.py:996] (0/4) Epoch 4, batch 24750, loss[loss=0.1904, simple_loss=0.2617, pruned_loss=0.05959, over 21363.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2998, pruned_loss=0.08091, over 4253324.57 frames. ], batch size: 131, lr: 7.64e-03, grad_scale: 32.0 2023-06-21 02:46:16,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=697464.0, ans=0.0 2023-06-21 02:46:36,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=697524.0, ans=0.125 2023-06-21 02:46:53,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=697584.0, ans=0.1 2023-06-21 02:47:18,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=697644.0, ans=0.125 2023-06-21 02:47:52,390 INFO [train.py:996] (0/4) Epoch 4, batch 24800, loss[loss=0.2499, simple_loss=0.2915, pruned_loss=0.1042, over 21433.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2945, pruned_loss=0.08021, over 4257856.14 frames. ], batch size: 473, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:48:09,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=697704.0, ans=0.125 2023-06-21 02:48:12,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=12.0 2023-06-21 02:48:15,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=697764.0, ans=0.025 2023-06-21 02:48:28,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=697824.0, ans=0.125 2023-06-21 02:48:39,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=697824.0, ans=0.125 2023-06-21 02:48:56,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=697884.0, ans=0.125 2023-06-21 02:49:02,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.444e+02 2.860e+02 3.241e+02 4.627e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-21 02:49:18,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=697944.0, ans=0.0 2023-06-21 02:49:29,179 INFO [train.py:996] (0/4) Epoch 4, batch 24850, loss[loss=0.3301, simple_loss=0.3835, pruned_loss=0.1383, over 21618.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2965, pruned_loss=0.08276, over 4273409.93 frames. ], batch size: 508, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:50:26,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=698124.0, ans=0.125 2023-06-21 02:50:28,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=698124.0, ans=0.125 2023-06-21 02:51:10,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=698244.0, ans=0.125 2023-06-21 02:51:15,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=698244.0, ans=0.0 2023-06-21 02:51:32,400 INFO [train.py:996] (0/4) Epoch 4, batch 24900, loss[loss=0.2754, simple_loss=0.3421, pruned_loss=0.1044, over 21880.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3022, pruned_loss=0.085, over 4275931.37 frames. ], batch size: 371, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:51:49,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=698304.0, ans=22.5 2023-06-21 02:51:59,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=698364.0, ans=0.125 2023-06-21 02:52:01,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=698364.0, ans=0.125 2023-06-21 02:52:31,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=698424.0, ans=0.125 2023-06-21 02:52:48,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=698484.0, ans=0.125 2023-06-21 02:53:04,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=698484.0, ans=0.0 2023-06-21 02:53:10,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.670e+02 3.202e+02 3.948e+02 7.205e+02, threshold=6.404e+02, percent-clipped=5.0 2023-06-21 02:53:38,991 INFO [train.py:996] (0/4) Epoch 4, batch 24950, loss[loss=0.2209, simple_loss=0.2625, pruned_loss=0.08964, over 20345.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3104, pruned_loss=0.08934, over 4273411.15 frames. ], batch size: 703, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:54:19,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=698664.0, ans=0.1 2023-06-21 02:55:12,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-21 02:55:32,505 INFO [train.py:996] (0/4) Epoch 4, batch 25000, loss[loss=0.2314, simple_loss=0.2957, pruned_loss=0.08355, over 21645.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3149, pruned_loss=0.09091, over 4273863.14 frames. ], batch size: 332, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:57:02,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.552e+02 2.956e+02 3.917e+02 9.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-21 02:57:16,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=699144.0, ans=0.125 2023-06-21 02:57:19,773 INFO [train.py:996] (0/4) Epoch 4, batch 25050, loss[loss=0.2152, simple_loss=0.279, pruned_loss=0.07571, over 21713.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3078, pruned_loss=0.08866, over 4275447.54 frames. ], batch size: 333, lr: 7.63e-03, grad_scale: 32.0 2023-06-21 02:58:15,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=699324.0, ans=0.1 2023-06-21 02:58:30,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-21 02:58:42,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=699384.0, ans=0.0 2023-06-21 02:59:25,201 INFO [train.py:996] (0/4) Epoch 4, batch 25100, loss[loss=0.2243, simple_loss=0.2934, pruned_loss=0.07759, over 21263.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3011, pruned_loss=0.08724, over 4276106.53 frames. ], batch size: 176, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 02:59:28,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=699504.0, ans=0.125 2023-06-21 03:00:02,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=699564.0, ans=15.0 2023-06-21 03:00:29,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=699684.0, ans=0.07 2023-06-21 03:00:42,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=699684.0, ans=0.1 2023-06-21 03:00:46,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.444e+02 2.736e+02 3.224e+02 6.054e+02, threshold=5.473e+02, percent-clipped=1.0 2023-06-21 03:01:08,941 INFO [train.py:996] (0/4) Epoch 4, batch 25150, loss[loss=0.2444, simple_loss=0.3393, pruned_loss=0.07477, over 21674.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3052, pruned_loss=0.08532, over 4258372.01 frames. ], batch size: 414, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:01:24,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=699864.0, ans=0.125 2023-06-21 03:01:29,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=699864.0, ans=0.125 2023-06-21 03:01:46,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-21 03:02:28,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=700044.0, ans=0.0 2023-06-21 03:02:44,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-21 03:02:45,530 INFO [train.py:996] (0/4) Epoch 4, batch 25200, loss[loss=0.2437, simple_loss=0.333, pruned_loss=0.07724, over 21684.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3037, pruned_loss=0.08228, over 4254283.27 frames. ], batch size: 414, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:02:53,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=700104.0, ans=0.125 2023-06-21 03:02:58,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=700104.0, ans=0.2 2023-06-21 03:03:01,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=700164.0, ans=0.125 2023-06-21 03:03:17,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=700164.0, ans=0.125 2023-06-21 03:03:37,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=700224.0, ans=0.125 2023-06-21 03:03:57,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=700284.0, ans=0.015 2023-06-21 03:04:01,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.161e+02 2.477e+02 2.831e+02 4.094e+02, threshold=4.954e+02, percent-clipped=0.0 2023-06-21 03:04:20,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=700344.0, ans=0.125 2023-06-21 03:04:23,325 INFO [train.py:996] (0/4) Epoch 4, batch 25250, loss[loss=0.2157, simple_loss=0.2699, pruned_loss=0.08074, over 21271.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3023, pruned_loss=0.0814, over 4249975.00 frames. ], batch size: 144, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:04:28,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=700404.0, ans=0.2 2023-06-21 03:04:34,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=700404.0, ans=0.2 2023-06-21 03:05:08,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=700524.0, ans=0.2 2023-06-21 03:06:01,003 INFO [train.py:996] (0/4) Epoch 4, batch 25300, loss[loss=0.2244, simple_loss=0.2998, pruned_loss=0.07451, over 21617.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2992, pruned_loss=0.08041, over 4259924.14 frames. ], batch size: 230, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:06:10,289 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:06:46,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=700764.0, ans=0.2 2023-06-21 03:07:20,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-21 03:07:35,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.516e+02 3.011e+02 3.707e+02 6.335e+02, threshold=6.023e+02, percent-clipped=10.0 2023-06-21 03:07:54,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-06-21 03:08:04,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.37 vs. limit=22.5 2023-06-21 03:08:04,599 INFO [train.py:996] (0/4) Epoch 4, batch 25350, loss[loss=0.2158, simple_loss=0.2921, pruned_loss=0.06971, over 21766.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3031, pruned_loss=0.0808, over 4253146.28 frames. ], batch size: 371, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:08:12,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=701004.0, ans=0.1 2023-06-21 03:08:20,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=701004.0, ans=0.0 2023-06-21 03:08:34,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=701064.0, ans=0.0 2023-06-21 03:08:37,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=701064.0, ans=0.125 2023-06-21 03:08:54,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=701124.0, ans=0.015 2023-06-21 03:08:54,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=701124.0, ans=0.2 2023-06-21 03:09:03,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=701124.0, ans=0.125 2023-06-21 03:09:13,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-21 03:09:46,632 INFO [train.py:996] (0/4) Epoch 4, batch 25400, loss[loss=0.209, simple_loss=0.3013, pruned_loss=0.05836, over 19853.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2995, pruned_loss=0.07891, over 4249448.85 frames. ], batch size: 702, lr: 7.62e-03, grad_scale: 32.0 2023-06-21 03:09:58,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=701304.0, ans=0.125 2023-06-21 03:10:58,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-21 03:11:13,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.377e+02 2.648e+02 3.077e+02 4.908e+02, threshold=5.297e+02, percent-clipped=0.0 2023-06-21 03:11:30,291 INFO [train.py:996] (0/4) Epoch 4, batch 25450, loss[loss=0.1995, simple_loss=0.2922, pruned_loss=0.05343, over 21559.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2997, pruned_loss=0.07944, over 4248969.54 frames. ], batch size: 230, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:11:39,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=701604.0, ans=0.125 2023-06-21 03:11:41,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=701604.0, ans=0.035 2023-06-21 03:12:33,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=701724.0, ans=0.125 2023-06-21 03:12:46,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=701784.0, ans=0.2 2023-06-21 03:12:54,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=701784.0, ans=0.0 2023-06-21 03:12:54,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=701784.0, ans=0.0 2023-06-21 03:13:31,376 INFO [train.py:996] (0/4) Epoch 4, batch 25500, loss[loss=0.2959, simple_loss=0.3715, pruned_loss=0.1101, over 21454.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3011, pruned_loss=0.0775, over 4253802.34 frames. ], batch size: 507, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:13:33,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=701904.0, ans=0.0 2023-06-21 03:14:08,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-21 03:14:22,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=701964.0, ans=0.125 2023-06-21 03:15:04,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.421e+02 2.742e+02 3.267e+02 5.395e+02, threshold=5.484e+02, percent-clipped=1.0 2023-06-21 03:15:18,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=702144.0, ans=0.125 2023-06-21 03:15:41,948 INFO [train.py:996] (0/4) Epoch 4, batch 25550, loss[loss=0.2614, simple_loss=0.3616, pruned_loss=0.08064, over 21585.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3078, pruned_loss=0.07813, over 4252943.90 frames. ], batch size: 471, lr: 7.61e-03, grad_scale: 16.0 2023-06-21 03:16:32,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=702264.0, ans=0.0 2023-06-21 03:16:57,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=702384.0, ans=0.125 2023-06-21 03:17:03,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=702384.0, ans=0.0 2023-06-21 03:17:11,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702384.0, ans=0.1 2023-06-21 03:17:35,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=702504.0, ans=0.2 2023-06-21 03:17:36,135 INFO [train.py:996] (0/4) Epoch 4, batch 25600, loss[loss=0.2504, simple_loss=0.3263, pruned_loss=0.08722, over 21399.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3128, pruned_loss=0.0794, over 4264244.19 frames. ], batch size: 211, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:18:25,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=702564.0, ans=0.125 2023-06-21 03:18:33,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=702624.0, ans=0.0 2023-06-21 03:18:36,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=702624.0, ans=0.125 2023-06-21 03:18:36,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-21 03:18:50,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2023-06-21 03:18:55,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=702684.0, ans=0.2 2023-06-21 03:19:11,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702684.0, ans=0.1 2023-06-21 03:19:16,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.481e+02 2.894e+02 3.647e+02 7.641e+02, threshold=5.788e+02, percent-clipped=5.0 2023-06-21 03:19:31,805 INFO [train.py:996] (0/4) Epoch 4, batch 25650, loss[loss=0.2345, simple_loss=0.2982, pruned_loss=0.08537, over 21886.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3137, pruned_loss=0.08216, over 4258816.37 frames. ], batch size: 107, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:19:47,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=702804.0, ans=0.0 2023-06-21 03:20:08,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=702864.0, ans=0.125 2023-06-21 03:20:30,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=702924.0, ans=0.0 2023-06-21 03:20:54,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=8.0 2023-06-21 03:21:22,354 INFO [train.py:996] (0/4) Epoch 4, batch 25700, loss[loss=0.2375, simple_loss=0.3069, pruned_loss=0.08404, over 21875.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3114, pruned_loss=0.0827, over 4248182.59 frames. ], batch size: 118, lr: 7.61e-03, grad_scale: 32.0 2023-06-21 03:22:01,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-21 03:22:29,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-21 03:22:36,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=703284.0, ans=0.0 2023-06-21 03:22:43,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.692e+02 3.025e+02 3.444e+02 5.063e+02, threshold=6.050e+02, percent-clipped=0.0 2023-06-21 03:23:08,322 INFO [train.py:996] (0/4) Epoch 4, batch 25750, loss[loss=0.2609, simple_loss=0.3315, pruned_loss=0.09512, over 21615.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3158, pruned_loss=0.08536, over 4256644.40 frames. ], batch size: 389, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:24:30,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=703524.0, ans=0.125 2023-06-21 03:24:51,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=703584.0, ans=0.125 2023-06-21 03:25:07,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=703584.0, ans=0.0 2023-06-21 03:25:44,703 INFO [train.py:996] (0/4) Epoch 4, batch 25800, loss[loss=0.3124, simple_loss=0.3936, pruned_loss=0.1156, over 21817.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3264, pruned_loss=0.09013, over 4259496.30 frames. ], batch size: 118, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:26:21,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=703764.0, ans=0.125 2023-06-21 03:26:22,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=703764.0, ans=0.1 2023-06-21 03:27:03,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=703824.0, ans=0.125 2023-06-21 03:27:33,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.719e+02 3.110e+02 3.607e+02 5.723e+02, threshold=6.221e+02, percent-clipped=0.0 2023-06-21 03:27:53,795 INFO [train.py:996] (0/4) Epoch 4, batch 25850, loss[loss=0.2235, simple_loss=0.2889, pruned_loss=0.07906, over 21439.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3303, pruned_loss=0.09013, over 4268889.97 frames. ], batch size: 177, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:28:39,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=704124.0, ans=0.125 2023-06-21 03:29:05,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-21 03:29:59,641 INFO [train.py:996] (0/4) Epoch 4, batch 25900, loss[loss=0.3339, simple_loss=0.4152, pruned_loss=0.1263, over 21657.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3311, pruned_loss=0.09027, over 4271275.36 frames. ], batch size: 441, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:31:17,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-21 03:31:32,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=704484.0, ans=0.125 2023-06-21 03:31:50,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=704544.0, ans=0.2 2023-06-21 03:31:51,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.727e+02 3.001e+02 3.716e+02 5.320e+02, threshold=6.003e+02, percent-clipped=0.0 2023-06-21 03:32:06,564 INFO [train.py:996] (0/4) Epoch 4, batch 25950, loss[loss=0.226, simple_loss=0.2688, pruned_loss=0.09157, over 20397.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3351, pruned_loss=0.09291, over 4268116.39 frames. ], batch size: 702, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:33:08,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=704664.0, ans=0.125 2023-06-21 03:33:17,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-21 03:33:33,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-21 03:33:34,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-21 03:33:46,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=704844.0, ans=0.0 2023-06-21 03:33:47,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=704844.0, ans=0.125 2023-06-21 03:33:57,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704844.0, ans=0.1 2023-06-21 03:34:00,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-21 03:34:13,436 INFO [train.py:996] (0/4) Epoch 4, batch 26000, loss[loss=0.2764, simple_loss=0.3509, pruned_loss=0.101, over 21965.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3349, pruned_loss=0.09137, over 4271576.54 frames. ], batch size: 372, lr: 7.60e-03, grad_scale: 32.0 2023-06-21 03:34:15,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=704904.0, ans=10.0 2023-06-21 03:34:19,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=704904.0, ans=0.125 2023-06-21 03:34:32,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=704904.0, ans=10.0 2023-06-21 03:34:52,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-21 03:34:54,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704964.0, ans=0.1 2023-06-21 03:35:02,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=705024.0, ans=0.125 2023-06-21 03:35:10,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-21 03:35:45,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.487e+02 2.984e+02 3.738e+02 5.035e+02, threshold=5.968e+02, percent-clipped=0.0 2023-06-21 03:36:20,346 INFO [train.py:996] (0/4) Epoch 4, batch 26050, loss[loss=0.2394, simple_loss=0.3012, pruned_loss=0.08881, over 21717.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3347, pruned_loss=0.09097, over 4269131.29 frames. ], batch size: 230, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:36:52,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-21 03:36:54,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=705264.0, ans=0.1 2023-06-21 03:37:25,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=705384.0, ans=0.0 2023-06-21 03:37:57,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=705444.0, ans=0.125 2023-06-21 03:38:12,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=705504.0, ans=0.125 2023-06-21 03:38:13,463 INFO [train.py:996] (0/4) Epoch 4, batch 26100, loss[loss=0.2541, simple_loss=0.3273, pruned_loss=0.09042, over 21877.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.329, pruned_loss=0.09104, over 4281921.69 frames. ], batch size: 107, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:38:15,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=705504.0, ans=0.125 2023-06-21 03:39:06,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=705564.0, ans=0.125 2023-06-21 03:39:24,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=705624.0, ans=0.1 2023-06-21 03:39:45,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.666e+02 3.098e+02 3.751e+02 6.264e+02, threshold=6.196e+02, percent-clipped=2.0 2023-06-21 03:40:00,817 INFO [train.py:996] (0/4) Epoch 4, batch 26150, loss[loss=0.2385, simple_loss=0.3107, pruned_loss=0.08317, over 21811.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3255, pruned_loss=0.09145, over 4287390.29 frames. ], batch size: 282, lr: 7.59e-03, grad_scale: 32.0 2023-06-21 03:40:22,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=705804.0, ans=0.125 2023-06-21 03:40:29,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-21 03:41:18,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=705924.0, ans=0.04949747468305833 2023-06-21 03:42:31,088 INFO [train.py:996] (0/4) Epoch 4, batch 26200, loss[loss=0.2768, simple_loss=0.3671, pruned_loss=0.09322, over 21652.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3264, pruned_loss=0.08993, over 4286883.85 frames. ], batch size: 389, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:42:55,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=706164.0, ans=0.0 2023-06-21 03:43:09,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=706224.0, ans=0.1 2023-06-21 03:43:59,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=706284.0, ans=0.1 2023-06-21 03:44:04,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.476e+02 2.776e+02 3.346e+02 6.662e+02, threshold=5.552e+02, percent-clipped=1.0 2023-06-21 03:44:38,059 INFO [train.py:996] (0/4) Epoch 4, batch 26250, loss[loss=0.267, simple_loss=0.3403, pruned_loss=0.09692, over 21889.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3306, pruned_loss=0.08827, over 4285148.59 frames. ], batch size: 124, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:44:42,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=706404.0, ans=0.1 2023-06-21 03:45:54,655 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:46:35,588 INFO [train.py:996] (0/4) Epoch 4, batch 26300, loss[loss=0.227, simple_loss=0.2979, pruned_loss=0.07811, over 21654.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3269, pruned_loss=0.08869, over 4292096.80 frames. ], batch size: 263, lr: 7.59e-03, grad_scale: 16.0 2023-06-21 03:47:43,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=706824.0, ans=0.1 2023-06-21 03:48:39,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.559e+02 2.845e+02 3.128e+02 5.313e+02, threshold=5.690e+02, percent-clipped=0.0 2023-06-21 03:48:53,022 INFO [train.py:996] (0/4) Epoch 4, batch 26350, loss[loss=0.2625, simple_loss=0.3307, pruned_loss=0.09718, over 21862.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3246, pruned_loss=0.08918, over 4291814.01 frames. ], batch size: 371, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:49:07,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=707004.0, ans=0.1 2023-06-21 03:49:27,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=707064.0, ans=0.1 2023-06-21 03:50:46,308 INFO [train.py:996] (0/4) Epoch 4, batch 26400, loss[loss=0.2611, simple_loss=0.2955, pruned_loss=0.1134, over 21455.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3182, pruned_loss=0.08911, over 4292572.69 frames. ], batch size: 510, lr: 7.58e-03, grad_scale: 32.0 2023-06-21 03:50:49,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=707304.0, ans=0.1 2023-06-21 03:51:08,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=22.5 2023-06-21 03:51:09,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=707364.0, ans=0.0 2023-06-21 03:52:11,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 2.879e+02 3.292e+02 3.769e+02 5.955e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-21 03:52:33,057 INFO [train.py:996] (0/4) Epoch 4, batch 26450, loss[loss=0.2311, simple_loss=0.2897, pruned_loss=0.08629, over 21267.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3164, pruned_loss=0.08821, over 4286860.34 frames. ], batch size: 159, lr: 7.58e-03, grad_scale: 32.0 2023-06-21 03:52:53,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=707604.0, ans=0.125 2023-06-21 03:53:07,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=707604.0, ans=0.125 2023-06-21 03:53:36,067 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:54:17,902 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:54:49,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-21 03:54:54,307 INFO [train.py:996] (0/4) Epoch 4, batch 26500, loss[loss=0.2207, simple_loss=0.2964, pruned_loss=0.07255, over 21657.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3205, pruned_loss=0.0879, over 4282432.71 frames. ], batch size: 263, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:55:34,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.72 vs. limit=15.0 2023-06-21 03:56:18,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=708024.0, ans=0.0 2023-06-21 03:56:27,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=708084.0, ans=0.125 2023-06-21 03:56:28,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=708084.0, ans=0.0 2023-06-21 03:56:40,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=708084.0, ans=0.125 2023-06-21 03:56:51,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=708144.0, ans=0.125 2023-06-21 03:56:53,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.479e+02 2.895e+02 3.453e+02 8.083e+02, threshold=5.789e+02, percent-clipped=2.0 2023-06-21 03:57:24,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=708204.0, ans=0.125 2023-06-21 03:57:25,083 INFO [train.py:996] (0/4) Epoch 4, batch 26550, loss[loss=0.2377, simple_loss=0.3282, pruned_loss=0.07362, over 21705.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3176, pruned_loss=0.08485, over 4281388.78 frames. ], batch size: 391, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 03:57:30,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=708204.0, ans=0.125 2023-06-21 03:58:21,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-21 03:58:34,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708324.0, ans=0.1 2023-06-21 03:58:59,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708384.0, ans=0.1 2023-06-21 03:59:37,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=708444.0, ans=0.2 2023-06-21 03:59:43,289 INFO [train.py:996] (0/4) Epoch 4, batch 26600, loss[loss=0.2371, simple_loss=0.3175, pruned_loss=0.07839, over 21724.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3169, pruned_loss=0.08222, over 4274493.89 frames. ], batch size: 351, lr: 7.58e-03, grad_scale: 16.0 2023-06-21 04:00:28,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=708624.0, ans=0.2 2023-06-21 04:00:58,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=708684.0, ans=0.125 2023-06-21 04:01:18,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=708744.0, ans=0.95 2023-06-21 04:01:27,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.406e+02 2.882e+02 3.356e+02 5.632e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-21 04:01:39,518 INFO [train.py:996] (0/4) Epoch 4, batch 26650, loss[loss=0.1756, simple_loss=0.2612, pruned_loss=0.04501, over 21662.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3087, pruned_loss=0.08056, over 4250165.70 frames. ], batch size: 415, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:01:46,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=708804.0, ans=0.125 2023-06-21 04:01:53,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=708804.0, ans=0.0 2023-06-21 04:02:04,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=708864.0, ans=0.125 2023-06-21 04:02:34,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=708924.0, ans=0.125 2023-06-21 04:02:35,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=708924.0, ans=0.2 2023-06-21 04:02:58,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-21 04:03:11,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=709044.0, ans=0.125 2023-06-21 04:03:15,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.53 vs. limit=15.0 2023-06-21 04:03:30,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=709044.0, ans=0.125 2023-06-21 04:03:38,413 INFO [train.py:996] (0/4) Epoch 4, batch 26700, loss[loss=0.2984, simple_loss=0.3376, pruned_loss=0.1296, over 21811.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3027, pruned_loss=0.07819, over 4256300.17 frames. ], batch size: 508, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:03:50,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=709104.0, ans=0.0 2023-06-21 04:04:21,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-21 04:05:03,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 2.086e+02 2.352e+02 2.691e+02 3.815e+02, threshold=4.705e+02, percent-clipped=0.0 2023-06-21 04:05:19,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=709404.0, ans=0.04949747468305833 2023-06-21 04:05:20,969 INFO [train.py:996] (0/4) Epoch 4, batch 26750, loss[loss=0.2755, simple_loss=0.3511, pruned_loss=0.09998, over 21559.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3034, pruned_loss=0.07777, over 4264754.36 frames. ], batch size: 131, lr: 7.57e-03, grad_scale: 16.0 2023-06-21 04:05:57,708 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:06:00,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=709524.0, ans=0.125 2023-06-21 04:06:10,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=709584.0, ans=0.0 2023-06-21 04:06:19,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=709584.0, ans=0.0 2023-06-21 04:06:21,189 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:06:39,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-21 04:06:43,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=709644.0, ans=0.1 2023-06-21 04:07:04,519 INFO [train.py:996] (0/4) Epoch 4, batch 26800, loss[loss=0.2761, simple_loss=0.3499, pruned_loss=0.1012, over 21457.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3117, pruned_loss=0.08299, over 4265035.61 frames. ], batch size: 131, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:07:15,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=709704.0, ans=0.0 2023-06-21 04:07:15,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=709704.0, ans=0.125 2023-06-21 04:07:34,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=709824.0, ans=0.125 2023-06-21 04:08:11,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=709884.0, ans=0.125 2023-06-21 04:08:14,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=709884.0, ans=15.0 2023-06-21 04:08:19,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.660e+02 3.035e+02 3.389e+02 6.268e+02, threshold=6.069e+02, percent-clipped=8.0 2023-06-21 04:08:19,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=709944.0, ans=0.0 2023-06-21 04:08:36,510 INFO [train.py:996] (0/4) Epoch 4, batch 26850, loss[loss=0.2206, simple_loss=0.2755, pruned_loss=0.08286, over 21254.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3128, pruned_loss=0.08526, over 4265765.97 frames. ], batch size: 159, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:09:47,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-21 04:09:51,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710244.0, ans=0.1 2023-06-21 04:09:54,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710244.0, ans=0.1 2023-06-21 04:09:58,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=710244.0, ans=0.125 2023-06-21 04:10:09,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=710244.0, ans=0.1 2023-06-21 04:10:12,270 INFO [train.py:996] (0/4) Epoch 4, batch 26900, loss[loss=0.2124, simple_loss=0.2728, pruned_loss=0.07598, over 21637.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3044, pruned_loss=0.08439, over 4265296.58 frames. ], batch size: 333, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:10:13,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-21 04:10:28,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=710364.0, ans=0.0 2023-06-21 04:10:30,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=710364.0, ans=0.1 2023-06-21 04:11:26,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=710544.0, ans=0.0 2023-06-21 04:11:29,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.421e+02 2.683e+02 3.097e+02 4.785e+02, threshold=5.366e+02, percent-clipped=0.0 2023-06-21 04:11:47,564 INFO [train.py:996] (0/4) Epoch 4, batch 26950, loss[loss=0.2737, simple_loss=0.3633, pruned_loss=0.09204, over 21624.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3029, pruned_loss=0.0837, over 4267471.42 frames. ], batch size: 389, lr: 7.57e-03, grad_scale: 32.0 2023-06-21 04:11:51,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710604.0, ans=0.1 2023-06-21 04:12:05,907 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:13:17,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=710844.0, ans=0.1 2023-06-21 04:13:24,352 INFO [train.py:996] (0/4) Epoch 4, batch 27000, loss[loss=0.2093, simple_loss=0.2994, pruned_loss=0.05957, over 21751.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3042, pruned_loss=0.08153, over 4268081.58 frames. ], batch size: 282, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:13:24,353 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 04:14:23,370 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2574, simple_loss=0.3499, pruned_loss=0.08242, over 1796401.00 frames. 2023-06-21 04:14:23,372 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 04:14:27,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=710904.0, ans=0.0 2023-06-21 04:15:04,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=710964.0, ans=0.125 2023-06-21 04:15:40,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=711084.0, ans=0.07 2023-06-21 04:15:48,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.638e+02 2.341e+02 2.653e+02 3.174e+02 5.780e+02, threshold=5.306e+02, percent-clipped=1.0 2023-06-21 04:15:48,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=711144.0, ans=0.0 2023-06-21 04:15:59,801 INFO [train.py:996] (0/4) Epoch 4, batch 27050, loss[loss=0.2023, simple_loss=0.2961, pruned_loss=0.05421, over 21398.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3064, pruned_loss=0.07819, over 4268694.80 frames. ], batch size: 211, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:16:24,658 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:16:59,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=711384.0, ans=0.125 2023-06-21 04:17:10,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=22.5 2023-06-21 04:17:36,459 INFO [train.py:996] (0/4) Epoch 4, batch 27100, loss[loss=0.2242, simple_loss=0.3092, pruned_loss=0.06961, over 21456.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3076, pruned_loss=0.07926, over 4276324.49 frames. ], batch size: 131, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:17:39,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=711504.0, ans=0.2 2023-06-21 04:17:48,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=711504.0, ans=0.0 2023-06-21 04:17:51,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=711504.0, ans=0.125 2023-06-21 04:18:28,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=711564.0, ans=0.0 2023-06-21 04:18:54,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=711684.0, ans=0.5 2023-06-21 04:19:12,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=711744.0, ans=0.125 2023-06-21 04:19:13,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.531e+02 3.029e+02 3.575e+02 6.566e+02, threshold=6.059e+02, percent-clipped=4.0 2023-06-21 04:19:21,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-21 04:19:25,872 INFO [train.py:996] (0/4) Epoch 4, batch 27150, loss[loss=0.2362, simple_loss=0.3171, pruned_loss=0.07759, over 21145.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3195, pruned_loss=0.08361, over 4285506.92 frames. ], batch size: 143, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:20:04,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-21 04:20:06,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-21 04:20:26,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-21 04:20:32,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711984.0, ans=0.1 2023-06-21 04:20:37,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=711984.0, ans=0.5 2023-06-21 04:20:40,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711984.0, ans=0.1 2023-06-21 04:20:52,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=712044.0, ans=0.2 2023-06-21 04:20:58,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=712044.0, ans=0.2 2023-06-21 04:21:13,179 INFO [train.py:996] (0/4) Epoch 4, batch 27200, loss[loss=0.2527, simple_loss=0.3124, pruned_loss=0.09652, over 20013.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3276, pruned_loss=0.08658, over 4277167.72 frames. ], batch size: 703, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:21:56,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=712224.0, ans=0.0 2023-06-21 04:21:56,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=712224.0, ans=0.2 2023-06-21 04:22:37,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=712284.0, ans=0.125 2023-06-21 04:23:02,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.916e+02 3.596e+02 4.478e+02 6.197e+02, threshold=7.191e+02, percent-clipped=3.0 2023-06-21 04:23:11,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=712344.0, ans=0.2 2023-06-21 04:23:20,365 INFO [train.py:996] (0/4) Epoch 4, batch 27250, loss[loss=0.2705, simple_loss=0.3334, pruned_loss=0.1038, over 21948.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3302, pruned_loss=0.09061, over 4277228.31 frames. ], batch size: 372, lr: 7.56e-03, grad_scale: 32.0 2023-06-21 04:23:21,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-21 04:23:25,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=712404.0, ans=0.1 2023-06-21 04:24:49,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=712644.0, ans=0.2 2023-06-21 04:25:01,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=712644.0, ans=0.0 2023-06-21 04:25:04,920 INFO [train.py:996] (0/4) Epoch 4, batch 27300, loss[loss=0.2682, simple_loss=0.3521, pruned_loss=0.09211, over 21793.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3323, pruned_loss=0.09179, over 4268531.27 frames. ], batch size: 124, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:25:07,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-21 04:25:15,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=712704.0, ans=0.0 2023-06-21 04:25:26,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=712764.0, ans=0.125 2023-06-21 04:26:02,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-21 04:26:50,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.547e+02 2.887e+02 3.256e+02 5.962e+02, threshold=5.774e+02, percent-clipped=0.0 2023-06-21 04:27:01,893 INFO [train.py:996] (0/4) Epoch 4, batch 27350, loss[loss=0.281, simple_loss=0.3559, pruned_loss=0.1031, over 21816.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3354, pruned_loss=0.09246, over 4272946.53 frames. ], batch size: 118, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:28:05,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713124.0, ans=0.1 2023-06-21 04:28:07,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=713124.0, ans=0.07 2023-06-21 04:28:26,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713124.0, ans=0.1 2023-06-21 04:28:53,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-21 04:28:55,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=713244.0, ans=0.1 2023-06-21 04:28:56,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-21 04:29:05,755 INFO [train.py:996] (0/4) Epoch 4, batch 27400, loss[loss=0.2286, simple_loss=0.2919, pruned_loss=0.08268, over 21770.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3306, pruned_loss=0.09174, over 4264717.04 frames. ], batch size: 316, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:29:14,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-21 04:29:46,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=713424.0, ans=0.125 2023-06-21 04:30:11,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713484.0, ans=0.1 2023-06-21 04:30:42,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.364e+02 2.653e+02 3.007e+02 3.565e+02, threshold=5.305e+02, percent-clipped=0.0 2023-06-21 04:30:46,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-21 04:30:46,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=713544.0, ans=0.0 2023-06-21 04:31:01,229 INFO [train.py:996] (0/4) Epoch 4, batch 27450, loss[loss=0.2247, simple_loss=0.3087, pruned_loss=0.07041, over 21639.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3237, pruned_loss=0.08978, over 4261472.51 frames. ], batch size: 298, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:32:07,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=713724.0, ans=0.125 2023-06-21 04:32:46,518 INFO [train.py:996] (0/4) Epoch 4, batch 27500, loss[loss=0.2569, simple_loss=0.3256, pruned_loss=0.09413, over 21886.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3228, pruned_loss=0.0904, over 4261932.82 frames. ], batch size: 124, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:32:51,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=713904.0, ans=0.0 2023-06-21 04:32:51,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-21 04:33:45,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=714024.0, ans=0.1 2023-06-21 04:34:15,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.580e+02 3.045e+02 3.674e+02 6.292e+02, threshold=6.090e+02, percent-clipped=1.0 2023-06-21 04:34:27,551 INFO [train.py:996] (0/4) Epoch 4, batch 27550, loss[loss=0.2116, simple_loss=0.2784, pruned_loss=0.07242, over 21773.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3183, pruned_loss=0.08698, over 4264248.52 frames. ], batch size: 371, lr: 7.55e-03, grad_scale: 32.0 2023-06-21 04:35:06,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-21 04:35:16,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=714324.0, ans=0.125 2023-06-21 04:36:03,178 INFO [train.py:996] (0/4) Epoch 4, batch 27600, loss[loss=0.2192, simple_loss=0.2783, pruned_loss=0.08004, over 21321.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.311, pruned_loss=0.08543, over 4260562.21 frames. ], batch size: 160, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:36:09,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=714504.0, ans=0.125 2023-06-21 04:36:56,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-21 04:37:27,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.399e+02 2.649e+02 3.187e+02 5.092e+02, threshold=5.298e+02, percent-clipped=0.0 2023-06-21 04:37:38,965 INFO [train.py:996] (0/4) Epoch 4, batch 27650, loss[loss=0.2176, simple_loss=0.3013, pruned_loss=0.06699, over 21453.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3042, pruned_loss=0.08426, over 4259129.90 frames. ], batch size: 194, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:37:48,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-21 04:37:49,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=714804.0, ans=0.1 2023-06-21 04:38:30,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=714924.0, ans=0.125 2023-06-21 04:39:20,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=715044.0, ans=0.125 2023-06-21 04:39:22,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-21 04:39:27,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=715044.0, ans=0.0 2023-06-21 04:39:31,717 INFO [train.py:996] (0/4) Epoch 4, batch 27700, loss[loss=0.2263, simple_loss=0.3039, pruned_loss=0.07441, over 21384.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3027, pruned_loss=0.08217, over 4257583.92 frames. ], batch size: 211, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:39:35,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=715104.0, ans=0.125 2023-06-21 04:39:54,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=715164.0, ans=0.125 2023-06-21 04:40:12,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-21 04:40:54,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=715344.0, ans=0.125 2023-06-21 04:40:56,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.480e+02 2.917e+02 3.384e+02 5.795e+02, threshold=5.834e+02, percent-clipped=2.0 2023-06-21 04:41:07,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-21 04:41:07,996 INFO [train.py:996] (0/4) Epoch 4, batch 27750, loss[loss=0.223, simple_loss=0.3038, pruned_loss=0.07109, over 21802.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3064, pruned_loss=0.08193, over 4262606.82 frames. ], batch size: 351, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:41:11,528 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:42:15,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=715584.0, ans=0.125 2023-06-21 04:42:45,358 INFO [train.py:996] (0/4) Epoch 4, batch 27800, loss[loss=0.2603, simple_loss=0.3302, pruned_loss=0.09519, over 21869.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3061, pruned_loss=0.08204, over 4275491.30 frames. ], batch size: 107, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:44:21,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.548e+02 3.003e+02 3.795e+02 7.044e+02, threshold=6.005e+02, percent-clipped=2.0 2023-06-21 04:44:33,697 INFO [train.py:996] (0/4) Epoch 4, batch 27850, loss[loss=0.2491, simple_loss=0.3018, pruned_loss=0.09817, over 21577.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3052, pruned_loss=0.08318, over 4283444.72 frames. ], batch size: 548, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:44:49,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-21 04:44:50,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=716004.0, ans=0.5 2023-06-21 04:45:25,723 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:46:06,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=716244.0, ans=0.125 2023-06-21 04:46:17,451 INFO [train.py:996] (0/4) Epoch 4, batch 27900, loss[loss=0.2406, simple_loss=0.3236, pruned_loss=0.07876, over 21464.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3165, pruned_loss=0.08538, over 4275748.86 frames. ], batch size: 194, lr: 7.54e-03, grad_scale: 32.0 2023-06-21 04:46:33,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=716304.0, ans=0.125 2023-06-21 04:47:06,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=716424.0, ans=0.125 2023-06-21 04:47:11,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=716424.0, ans=0.2 2023-06-21 04:47:14,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=716424.0, ans=0.2 2023-06-21 04:47:51,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-06-21 04:48:01,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.532e+02 3.006e+02 3.855e+02 8.106e+02, threshold=6.012e+02, percent-clipped=3.0 2023-06-21 04:48:05,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=716544.0, ans=0.015 2023-06-21 04:48:19,855 INFO [train.py:996] (0/4) Epoch 4, batch 27950, loss[loss=0.2884, simple_loss=0.362, pruned_loss=0.1074, over 21722.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3162, pruned_loss=0.08205, over 4272297.19 frames. ], batch size: 441, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:48:21,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-21 04:48:45,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-21 04:48:47,471 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-21 04:48:55,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=716664.0, ans=0.0 2023-06-21 04:49:05,967 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:49:42,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-21 04:50:11,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-21 04:50:18,743 INFO [train.py:996] (0/4) Epoch 4, batch 28000, loss[loss=0.2521, simple_loss=0.3141, pruned_loss=0.09508, over 21858.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3128, pruned_loss=0.07956, over 4276342.23 frames. ], batch size: 414, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:51:30,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=717084.0, ans=0.125 2023-06-21 04:51:43,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=717084.0, ans=0.125 2023-06-21 04:51:59,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.391e+02 2.762e+02 3.354e+02 5.546e+02, threshold=5.523e+02, percent-clipped=0.0 2023-06-21 04:52:22,060 INFO [train.py:996] (0/4) Epoch 4, batch 28050, loss[loss=0.2931, simple_loss=0.3595, pruned_loss=0.1134, over 21595.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3112, pruned_loss=0.08152, over 4278077.93 frames. ], batch size: 508, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:52:30,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-06-21 04:52:34,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=717204.0, ans=0.2 2023-06-21 04:52:50,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=717264.0, ans=0.04949747468305833 2023-06-21 04:52:55,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=717324.0, ans=0.0 2023-06-21 04:53:15,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=717324.0, ans=0.0 2023-06-21 04:54:16,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717444.0, ans=0.1 2023-06-21 04:54:21,887 INFO [train.py:996] (0/4) Epoch 4, batch 28100, loss[loss=0.2162, simple_loss=0.2938, pruned_loss=0.06935, over 20721.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3103, pruned_loss=0.0808, over 4274930.81 frames. ], batch size: 608, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:54:40,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=717504.0, ans=0.125 2023-06-21 04:54:55,846 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:55:09,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=717624.0, ans=0.05 2023-06-21 04:55:46,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=717684.0, ans=0.0 2023-06-21 04:56:20,285 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.718e+02 3.349e+02 4.404e+02 9.791e+02, threshold=6.698e+02, percent-clipped=12.0 2023-06-21 04:56:36,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717744.0, ans=0.1 2023-06-21 04:56:39,059 INFO [train.py:996] (0/4) Epoch 4, batch 28150, loss[loss=0.2187, simple_loss=0.2771, pruned_loss=0.08011, over 21606.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.304, pruned_loss=0.08108, over 4273826.55 frames. ], batch size: 332, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 04:56:50,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=717804.0, ans=0.2 2023-06-21 04:57:06,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717804.0, ans=0.1 2023-06-21 04:57:29,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=717864.0, ans=0.125 2023-06-21 04:58:01,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=717924.0, ans=0.0 2023-06-21 04:58:30,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717984.0, ans=0.1 2023-06-21 04:59:19,389 INFO [train.py:996] (0/4) Epoch 4, batch 28200, loss[loss=0.2369, simple_loss=0.3058, pruned_loss=0.08401, over 21688.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3001, pruned_loss=0.08238, over 4278603.05 frames. ], batch size: 112, lr: 7.53e-03, grad_scale: 32.0 2023-06-21 05:00:51,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-21 05:00:54,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=718284.0, ans=0.0 2023-06-21 05:01:01,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.643e+02 3.167e+02 3.877e+02 7.013e+02, threshold=6.334e+02, percent-clipped=1.0 2023-06-21 05:01:27,691 INFO [train.py:996] (0/4) Epoch 4, batch 28250, loss[loss=0.2149, simple_loss=0.28, pruned_loss=0.07489, over 21690.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3047, pruned_loss=0.08539, over 4270465.88 frames. ], batch size: 282, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:01:30,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=718404.0, ans=0.0 2023-06-21 05:01:31,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-21 05:01:51,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718404.0, ans=0.1 2023-06-21 05:01:58,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718464.0, ans=0.1 2023-06-21 05:02:18,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-21 05:02:21,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=718464.0, ans=0.035 2023-06-21 05:02:55,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=718524.0, ans=0.125 2023-06-21 05:02:56,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.40 vs. limit=6.0 2023-06-21 05:03:18,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=718584.0, ans=0.2 2023-06-21 05:03:21,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=718584.0, ans=0.125 2023-06-21 05:04:18,250 INFO [train.py:996] (0/4) Epoch 4, batch 28300, loss[loss=0.2026, simple_loss=0.2883, pruned_loss=0.0584, over 21772.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3033, pruned_loss=0.08352, over 4255159.19 frames. ], batch size: 282, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:04:29,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=718704.0, ans=0.0 2023-06-21 05:05:55,742 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:06:09,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 2.185e+02 2.470e+02 3.133e+02 5.042e+02, threshold=4.941e+02, percent-clipped=0.0 2023-06-21 05:06:39,497 INFO [train.py:996] (0/4) Epoch 4, batch 28350, loss[loss=0.2214, simple_loss=0.2869, pruned_loss=0.07794, over 21652.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2995, pruned_loss=0.07767, over 4260667.99 frames. ], batch size: 332, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:06:59,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=719004.0, ans=0.0 2023-06-21 05:07:06,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-21 05:07:07,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=719004.0, ans=0.04949747468305833 2023-06-21 05:08:03,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-21 05:08:18,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-21 05:08:50,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=719244.0, ans=0.125 2023-06-21 05:09:31,704 INFO [train.py:996] (0/4) Epoch 4, batch 28400, loss[loss=0.2198, simple_loss=0.2933, pruned_loss=0.07315, over 16284.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2959, pruned_loss=0.07667, over 4250967.65 frames. ], batch size: 61, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:09:59,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=719364.0, ans=0.1 2023-06-21 05:10:46,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=719424.0, ans=0.0 2023-06-21 05:11:42,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.715e+02 3.259e+02 3.998e+02 7.508e+02, threshold=6.518e+02, percent-clipped=8.0 2023-06-21 05:12:04,397 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:12:07,596 INFO [train.py:996] (0/4) Epoch 4, batch 28450, loss[loss=0.2492, simple_loss=0.3196, pruned_loss=0.0894, over 21692.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.301, pruned_loss=0.0804, over 4258816.31 frames. ], batch size: 298, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:12:58,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719664.0, ans=0.1 2023-06-21 05:13:26,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=719784.0, ans=0.0 2023-06-21 05:14:29,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=719844.0, ans=0.2 2023-06-21 05:14:54,377 INFO [train.py:996] (0/4) Epoch 4, batch 28500, loss[loss=0.2529, simple_loss=0.332, pruned_loss=0.08687, over 21413.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3039, pruned_loss=0.083, over 4267665.82 frames. ], batch size: 131, lr: 7.52e-03, grad_scale: 32.0 2023-06-21 05:15:02,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=719904.0, ans=0.125 2023-06-21 05:15:19,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=719964.0, ans=0.07 2023-06-21 05:15:34,972 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-120000.pt 2023-06-21 05:16:09,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=720024.0, ans=0.125 2023-06-21 05:16:47,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720084.0, ans=0.1 2023-06-21 05:16:56,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=720144.0, ans=0.125 2023-06-21 05:17:05,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.689e+02 3.008e+02 3.521e+02 7.488e+02, threshold=6.015e+02, percent-clipped=1.0 2023-06-21 05:17:37,837 INFO [train.py:996] (0/4) Epoch 4, batch 28550, loss[loss=0.2683, simple_loss=0.3562, pruned_loss=0.09019, over 21274.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3133, pruned_loss=0.08623, over 4270575.83 frames. ], batch size: 548, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:17:44,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=720204.0, ans=0.125 2023-06-21 05:18:12,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=720264.0, ans=0.125 2023-06-21 05:18:53,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=720324.0, ans=0.125 2023-06-21 05:18:53,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=720324.0, ans=0.2 2023-06-21 05:19:16,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=720384.0, ans=0.95 2023-06-21 05:20:28,533 INFO [train.py:996] (0/4) Epoch 4, batch 28600, loss[loss=0.2589, simple_loss=0.3089, pruned_loss=0.1044, over 20033.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3213, pruned_loss=0.08954, over 4272128.42 frames. ], batch size: 703, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:20:40,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=720504.0, ans=0.125 2023-06-21 05:20:44,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=720564.0, ans=0.95 2023-06-21 05:21:37,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=720624.0, ans=0.0 2023-06-21 05:21:37,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=720624.0, ans=0.125 2023-06-21 05:21:40,900 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:21:47,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=720684.0, ans=0.0 2023-06-21 05:21:52,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-21 05:22:06,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=720684.0, ans=0.125 2023-06-21 05:22:34,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=720744.0, ans=0.0 2023-06-21 05:22:37,118 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.703e+02 3.009e+02 3.352e+02 5.683e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-21 05:22:44,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=720744.0, ans=0.2 2023-06-21 05:22:51,419 INFO [train.py:996] (0/4) Epoch 4, batch 28650, loss[loss=0.2073, simple_loss=0.2669, pruned_loss=0.07388, over 21641.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3157, pruned_loss=0.08865, over 4279074.24 frames. ], batch size: 282, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:25:30,149 INFO [train.py:996] (0/4) Epoch 4, batch 28700, loss[loss=0.2221, simple_loss=0.2924, pruned_loss=0.07588, over 21145.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3149, pruned_loss=0.08959, over 4271602.21 frames. ], batch size: 143, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:25:37,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=721104.0, ans=0.2 2023-06-21 05:25:37,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=721104.0, ans=0.125 2023-06-21 05:26:02,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-21 05:26:07,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=721164.0, ans=0.125 2023-06-21 05:26:20,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=721164.0, ans=0.125 2023-06-21 05:27:43,280 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:27:52,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.736e+02 3.172e+02 3.821e+02 7.853e+02, threshold=6.343e+02, percent-clipped=6.0 2023-06-21 05:28:04,625 INFO [train.py:996] (0/4) Epoch 4, batch 28750, loss[loss=0.2716, simple_loss=0.3436, pruned_loss=0.09973, over 21819.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3131, pruned_loss=0.08964, over 4274918.10 frames. ], batch size: 112, lr: 7.51e-03, grad_scale: 16.0 2023-06-21 05:28:21,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=721404.0, ans=0.0 2023-06-21 05:28:22,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721404.0, ans=0.1 2023-06-21 05:30:54,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-21 05:30:56,309 INFO [train.py:996] (0/4) Epoch 4, batch 28800, loss[loss=0.2554, simple_loss=0.3302, pruned_loss=0.09037, over 21758.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3174, pruned_loss=0.08955, over 4263612.82 frames. ], batch size: 332, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:30:58,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721704.0, ans=0.1 2023-06-21 05:31:37,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.68 vs. limit=22.5 2023-06-21 05:31:51,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=721764.0, ans=0.125 2023-06-21 05:31:54,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=721764.0, ans=0.0 2023-06-21 05:33:14,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-21 05:33:18,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.546e+02 2.847e+02 3.234e+02 4.509e+02, threshold=5.694e+02, percent-clipped=0.0 2023-06-21 05:33:30,194 INFO [train.py:996] (0/4) Epoch 4, batch 28850, loss[loss=0.2634, simple_loss=0.322, pruned_loss=0.1024, over 21882.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3197, pruned_loss=0.09137, over 4269821.97 frames. ], batch size: 371, lr: 7.51e-03, grad_scale: 32.0 2023-06-21 05:34:04,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=722004.0, ans=0.125 2023-06-21 05:34:32,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-21 05:36:42,440 INFO [train.py:996] (0/4) Epoch 4, batch 28900, loss[loss=0.2603, simple_loss=0.3318, pruned_loss=0.09437, over 21357.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3225, pruned_loss=0.09336, over 4279576.25 frames. ], batch size: 548, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:38:48,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=722544.0, ans=0.0 2023-06-21 05:39:05,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.757e+02 3.326e+02 3.999e+02 7.317e+02, threshold=6.653e+02, percent-clipped=7.0 2023-06-21 05:39:44,237 INFO [train.py:996] (0/4) Epoch 4, batch 28950, loss[loss=0.2111, simple_loss=0.2944, pruned_loss=0.06391, over 21695.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3245, pruned_loss=0.09274, over 4283088.97 frames. ], batch size: 247, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:40:04,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=722604.0, ans=0.0 2023-06-21 05:40:19,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=722664.0, ans=0.125 2023-06-21 05:41:51,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-21 05:42:35,361 INFO [train.py:996] (0/4) Epoch 4, batch 29000, loss[loss=0.278, simple_loss=0.3365, pruned_loss=0.1097, over 19982.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3287, pruned_loss=0.0924, over 4279784.76 frames. ], batch size: 702, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:43:51,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=723024.0, ans=0.2 2023-06-21 05:43:53,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-21 05:43:57,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=723024.0, ans=0.125 2023-06-21 05:44:54,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.818e+02 3.202e+02 3.885e+02 5.481e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-21 05:45:34,348 INFO [train.py:996] (0/4) Epoch 4, batch 29050, loss[loss=0.288, simple_loss=0.3505, pruned_loss=0.1128, over 21787.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3258, pruned_loss=0.09274, over 4283421.48 frames. ], batch size: 112, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:45:58,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=723264.0, ans=0.0 2023-06-21 05:46:24,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=723324.0, ans=0.125 2023-06-21 05:46:34,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=723324.0, ans=0.035 2023-06-21 05:46:50,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=723384.0, ans=0.125 2023-06-21 05:46:54,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=723384.0, ans=0.125 2023-06-21 05:47:33,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=723444.0, ans=0.09899494936611666 2023-06-21 05:47:50,945 INFO [train.py:996] (0/4) Epoch 4, batch 29100, loss[loss=0.195, simple_loss=0.2535, pruned_loss=0.06829, over 21261.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3163, pruned_loss=0.08944, over 4287042.81 frames. ], batch size: 160, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:49:06,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=723624.0, ans=0.2 2023-06-21 05:50:00,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=723744.0, ans=0.0 2023-06-21 05:50:04,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2023-06-21 05:50:05,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=723744.0, ans=0.0 2023-06-21 05:50:06,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.494e+02 2.914e+02 3.696e+02 5.946e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 05:50:19,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=723744.0, ans=0.125 2023-06-21 05:50:36,697 INFO [train.py:996] (0/4) Epoch 4, batch 29150, loss[loss=0.2606, simple_loss=0.353, pruned_loss=0.08408, over 21844.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3139, pruned_loss=0.08769, over 4279744.07 frames. ], batch size: 371, lr: 7.50e-03, grad_scale: 32.0 2023-06-21 05:50:46,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=723804.0, ans=0.02 2023-06-21 05:51:04,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=723864.0, ans=0.125 2023-06-21 05:51:05,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=723864.0, ans=0.0 2023-06-21 05:53:06,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=724044.0, ans=0.0 2023-06-21 05:53:18,585 INFO [train.py:996] (0/4) Epoch 4, batch 29200, loss[loss=0.2114, simple_loss=0.2662, pruned_loss=0.07826, over 20750.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3093, pruned_loss=0.08672, over 4271318.54 frames. ], batch size: 607, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:54:14,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=724224.0, ans=0.0 2023-06-21 05:54:38,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=724224.0, ans=15.0 2023-06-21 05:55:44,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.500e+02 2.791e+02 3.395e+02 6.739e+02, threshold=5.582e+02, percent-clipped=1.0 2023-06-21 05:55:53,792 INFO [train.py:996] (0/4) Epoch 4, batch 29250, loss[loss=0.2352, simple_loss=0.3246, pruned_loss=0.07296, over 21858.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3071, pruned_loss=0.08387, over 4266739.35 frames. ], batch size: 373, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:55:57,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=724404.0, ans=0.125 2023-06-21 05:56:03,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=724404.0, ans=0.0 2023-06-21 05:57:00,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=724524.0, ans=0.0 2023-06-21 05:57:02,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=724524.0, ans=0.125 2023-06-21 05:57:29,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=724584.0, ans=0.0 2023-06-21 05:57:48,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=724644.0, ans=0.0 2023-06-21 05:58:05,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724644.0, ans=0.1 2023-06-21 05:58:06,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-21 05:58:12,174 INFO [train.py:996] (0/4) Epoch 4, batch 29300, loss[loss=0.2011, simple_loss=0.2696, pruned_loss=0.06626, over 21408.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3091, pruned_loss=0.08318, over 4272885.74 frames. ], batch size: 131, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 05:59:39,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-21 05:59:47,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=724884.0, ans=0.0 2023-06-21 06:00:47,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.479e+02 2.818e+02 3.454e+02 4.910e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-21 06:00:47,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=724944.0, ans=0.125 2023-06-21 06:00:51,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=724944.0, ans=0.125 2023-06-21 06:00:58,265 INFO [train.py:996] (0/4) Epoch 4, batch 29350, loss[loss=0.2461, simple_loss=0.3328, pruned_loss=0.07971, over 21856.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3059, pruned_loss=0.08276, over 4277059.21 frames. ], batch size: 373, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 06:01:26,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=725004.0, ans=0.125 2023-06-21 06:02:59,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=725184.0, ans=0.125 2023-06-21 06:03:35,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-21 06:03:35,559 INFO [train.py:996] (0/4) Epoch 4, batch 29400, loss[loss=0.2625, simple_loss=0.3383, pruned_loss=0.09337, over 21454.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3052, pruned_loss=0.08087, over 4272231.66 frames. ], batch size: 471, lr: 7.49e-03, grad_scale: 32.0 2023-06-21 06:04:20,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=725364.0, ans=0.0 2023-06-21 06:05:20,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=725484.0, ans=0.125 2023-06-21 06:05:20,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=725484.0, ans=0.0 2023-06-21 06:05:23,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=725484.0, ans=0.125 2023-06-21 06:05:42,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=22.5 2023-06-21 06:06:06,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.554e+02 2.866e+02 3.298e+02 4.992e+02, threshold=5.731e+02, percent-clipped=0.0 2023-06-21 06:06:14,138 INFO [train.py:996] (0/4) Epoch 4, batch 29450, loss[loss=0.2469, simple_loss=0.3271, pruned_loss=0.0834, over 21329.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3029, pruned_loss=0.07949, over 4275053.33 frames. ], batch size: 549, lr: 7.49e-03, grad_scale: 16.0 2023-06-21 06:06:28,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=725604.0, ans=0.0 2023-06-21 06:07:00,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=725664.0, ans=0.0 2023-06-21 06:07:55,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=725724.0, ans=0.0 2023-06-21 06:08:01,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-21 06:08:07,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=725784.0, ans=0.1 2023-06-21 06:08:59,114 INFO [train.py:996] (0/4) Epoch 4, batch 29500, loss[loss=0.2347, simple_loss=0.2999, pruned_loss=0.08474, over 21809.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.309, pruned_loss=0.08388, over 4279773.42 frames. ], batch size: 124, lr: 7.49e-03, grad_scale: 16.0 2023-06-21 06:09:52,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=725964.0, ans=0.125 2023-06-21 06:11:24,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-21 06:11:27,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.546e+02 2.828e+02 3.561e+02 7.164e+02, threshold=5.655e+02, percent-clipped=2.0 2023-06-21 06:11:36,389 INFO [train.py:996] (0/4) Epoch 4, batch 29550, loss[loss=0.2584, simple_loss=0.3155, pruned_loss=0.1006, over 21887.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3087, pruned_loss=0.08613, over 4292023.23 frames. ], batch size: 414, lr: 7.48e-03, grad_scale: 16.0 2023-06-21 06:11:54,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=726204.0, ans=0.125 2023-06-21 06:12:22,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.21 vs. limit=22.5 2023-06-21 06:13:58,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=726384.0, ans=0.125 2023-06-21 06:14:36,518 INFO [train.py:996] (0/4) Epoch 4, batch 29600, loss[loss=0.2701, simple_loss=0.36, pruned_loss=0.09012, over 21724.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3145, pruned_loss=0.0882, over 4290805.52 frames. ], batch size: 351, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:15:40,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=726564.0, ans=0.125 2023-06-21 06:16:30,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 06:16:45,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=726684.0, ans=0.125 2023-06-21 06:17:16,958 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:17:17,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.703e+02 2.966e+02 3.927e+02 7.887e+02, threshold=5.932e+02, percent-clipped=5.0 2023-06-21 06:17:20,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-21 06:17:21,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-06-21 06:17:25,417 INFO [train.py:996] (0/4) Epoch 4, batch 29650, loss[loss=0.2678, simple_loss=0.3793, pruned_loss=0.07815, over 19841.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3144, pruned_loss=0.08493, over 4288204.21 frames. ], batch size: 702, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:17:36,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=726804.0, ans=0.125 2023-06-21 06:18:11,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-21 06:18:52,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=726924.0, ans=0.125 2023-06-21 06:20:04,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-21 06:20:05,846 INFO [train.py:996] (0/4) Epoch 4, batch 29700, loss[loss=0.2413, simple_loss=0.3292, pruned_loss=0.07666, over 19912.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3138, pruned_loss=0.08438, over 4287599.98 frames. ], batch size: 702, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:20:14,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-21 06:20:58,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=727164.0, ans=0.0 2023-06-21 06:21:15,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-21 06:21:54,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=727284.0, ans=0.2 2023-06-21 06:22:16,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=727344.0, ans=0.125 2023-06-21 06:22:31,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=727344.0, ans=0.0 2023-06-21 06:22:32,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.514e+02 2.951e+02 3.470e+02 6.624e+02, threshold=5.902e+02, percent-clipped=3.0 2023-06-21 06:22:45,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=727344.0, ans=0.125 2023-06-21 06:22:49,525 INFO [train.py:996] (0/4) Epoch 4, batch 29750, loss[loss=0.2323, simple_loss=0.3195, pruned_loss=0.07251, over 21613.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3188, pruned_loss=0.08368, over 4291604.99 frames. ], batch size: 230, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:23:09,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=727404.0, ans=0.0 2023-06-21 06:24:24,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=727584.0, ans=0.125 2023-06-21 06:24:25,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=22.5 2023-06-21 06:25:26,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=727644.0, ans=0.125 2023-06-21 06:25:34,518 INFO [train.py:996] (0/4) Epoch 4, batch 29800, loss[loss=0.2609, simple_loss=0.3199, pruned_loss=0.1009, over 21787.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3199, pruned_loss=0.08452, over 4293212.91 frames. ], batch size: 441, lr: 7.48e-03, grad_scale: 32.0 2023-06-21 06:26:12,794 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:26:50,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=727824.0, ans=0.015 2023-06-21 06:27:11,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727824.0, ans=0.1 2023-06-21 06:27:26,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=727884.0, ans=0.125 2023-06-21 06:27:36,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=727884.0, ans=0.0 2023-06-21 06:27:40,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=727884.0, ans=0.125 2023-06-21 06:28:12,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.465e+02 2.740e+02 3.189e+02 4.611e+02, threshold=5.479e+02, percent-clipped=0.0 2023-06-21 06:28:19,325 INFO [train.py:996] (0/4) Epoch 4, batch 29850, loss[loss=0.1955, simple_loss=0.2778, pruned_loss=0.05663, over 21622.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3161, pruned_loss=0.08293, over 4288916.55 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:28:21,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=728004.0, ans=0.125 2023-06-21 06:28:30,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.89 vs. limit=5.0 2023-06-21 06:29:47,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=728124.0, ans=0.2 2023-06-21 06:30:52,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-21 06:31:03,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.13 vs. limit=22.5 2023-06-21 06:31:05,325 INFO [train.py:996] (0/4) Epoch 4, batch 29900, loss[loss=0.2972, simple_loss=0.356, pruned_loss=0.1192, over 21582.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3136, pruned_loss=0.0844, over 4292200.65 frames. ], batch size: 389, lr: 7.47e-03, grad_scale: 16.0 2023-06-21 06:31:17,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=728304.0, ans=0.125 2023-06-21 06:31:20,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=728304.0, ans=0.125 2023-06-21 06:33:31,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.587e+02 2.985e+02 3.529e+02 5.856e+02, threshold=5.969e+02, percent-clipped=3.0 2023-06-21 06:33:48,925 INFO [train.py:996] (0/4) Epoch 4, batch 29950, loss[loss=0.2158, simple_loss=0.2628, pruned_loss=0.08441, over 20164.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3177, pruned_loss=0.08795, over 4289582.31 frames. ], batch size: 704, lr: 7.47e-03, grad_scale: 16.0 2023-06-21 06:34:03,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=728604.0, ans=0.125 2023-06-21 06:35:38,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=728784.0, ans=0.125 2023-06-21 06:35:39,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=728784.0, ans=0.125 2023-06-21 06:36:14,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=728844.0, ans=0.0 2023-06-21 06:36:36,777 INFO [train.py:996] (0/4) Epoch 4, batch 30000, loss[loss=0.231, simple_loss=0.3181, pruned_loss=0.07198, over 21665.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3203, pruned_loss=0.08866, over 4289747.79 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:36:36,778 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 06:37:40,808 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2514, simple_loss=0.3484, pruned_loss=0.07722, over 1796401.00 frames. 2023-06-21 06:37:40,809 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 06:38:03,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-21 06:38:53,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=729024.0, ans=0.0 2023-06-21 06:39:07,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=729084.0, ans=0.125 2023-06-21 06:40:00,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.416e+02 2.798e+02 3.507e+02 5.014e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-21 06:40:26,624 INFO [train.py:996] (0/4) Epoch 4, batch 30050, loss[loss=0.2768, simple_loss=0.3644, pruned_loss=0.09457, over 21700.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3227, pruned_loss=0.08579, over 4283634.55 frames. ], batch size: 298, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:41:31,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=729324.0, ans=0.2 2023-06-21 06:41:41,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=729384.0, ans=0.0 2023-06-21 06:42:42,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=729504.0, ans=0.125 2023-06-21 06:42:43,068 INFO [train.py:996] (0/4) Epoch 4, batch 30100, loss[loss=0.216, simple_loss=0.2773, pruned_loss=0.0773, over 21769.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3209, pruned_loss=0.08524, over 4281118.54 frames. ], batch size: 112, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:43:12,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=729504.0, ans=0.125 2023-06-21 06:43:21,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=22.5 2023-06-21 06:43:24,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=729564.0, ans=0.0 2023-06-21 06:44:38,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=729684.0, ans=0.0 2023-06-21 06:45:00,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.761e+02 3.179e+02 3.830e+02 7.077e+02, threshold=6.357e+02, percent-clipped=3.0 2023-06-21 06:45:32,469 INFO [train.py:996] (0/4) Epoch 4, batch 30150, loss[loss=0.2944, simple_loss=0.3487, pruned_loss=0.1201, over 21855.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3176, pruned_loss=0.08666, over 4275731.68 frames. ], batch size: 441, lr: 7.47e-03, grad_scale: 32.0 2023-06-21 06:45:32,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=729804.0, ans=0.2 2023-06-21 06:45:59,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.51 vs. limit=10.0 2023-06-21 06:46:09,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=729864.0, ans=0.125 2023-06-21 06:47:12,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.30 vs. limit=6.0 2023-06-21 06:47:16,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=729984.0, ans=0.125 2023-06-21 06:47:58,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=730044.0, ans=0.1 2023-06-21 06:47:59,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=730044.0, ans=0.2 2023-06-21 06:48:29,915 INFO [train.py:996] (0/4) Epoch 4, batch 30200, loss[loss=0.2611, simple_loss=0.3362, pruned_loss=0.09306, over 19958.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3214, pruned_loss=0.08631, over 4266229.46 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:48:33,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=730104.0, ans=0.125 2023-06-21 06:50:33,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=730284.0, ans=0.1 2023-06-21 06:51:08,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.530e+02 2.908e+02 3.490e+02 5.232e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-21 06:51:14,952 INFO [train.py:996] (0/4) Epoch 4, batch 30250, loss[loss=0.2714, simple_loss=0.3587, pruned_loss=0.09209, over 21255.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3262, pruned_loss=0.0879, over 4269268.24 frames. ], batch size: 159, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:51:23,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=730404.0, ans=0.125 2023-06-21 06:51:58,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=730464.0, ans=0.125 2023-06-21 06:52:07,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=730464.0, ans=0.125 2023-06-21 06:52:26,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=730524.0, ans=0.0 2023-06-21 06:53:48,473 INFO [train.py:996] (0/4) Epoch 4, batch 30300, loss[loss=0.2067, simple_loss=0.2666, pruned_loss=0.07342, over 21300.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3243, pruned_loss=0.08758, over 4269794.24 frames. ], batch size: 160, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:55:08,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=730824.0, ans=0.2 2023-06-21 06:55:37,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=730884.0, ans=0.125 2023-06-21 06:56:35,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.610e+02 3.012e+02 3.806e+02 5.738e+02, threshold=6.023e+02, percent-clipped=0.0 2023-06-21 06:56:42,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=730944.0, ans=0.0 2023-06-21 06:56:46,355 INFO [train.py:996] (0/4) Epoch 4, batch 30350, loss[loss=0.214, simple_loss=0.2777, pruned_loss=0.07514, over 21451.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3235, pruned_loss=0.08829, over 4266929.33 frames. ], batch size: 194, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 06:58:15,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-21 06:58:31,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=731184.0, ans=0.0 2023-06-21 06:58:52,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=731184.0, ans=0.125 2023-06-21 06:59:19,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-21 06:59:20,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=731244.0, ans=0.125 2023-06-21 07:00:07,936 INFO [train.py:996] (0/4) Epoch 4, batch 30400, loss[loss=0.2289, simple_loss=0.2905, pruned_loss=0.0837, over 20165.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.317, pruned_loss=0.08624, over 4254726.29 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 07:00:52,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=731304.0, ans=0.125 2023-06-21 07:01:38,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=731364.0, ans=0.125 2023-06-21 07:01:45,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=731364.0, ans=0.125 2023-06-21 07:02:13,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-21 07:05:08,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.449e+02 4.209e+02 5.452e+02 1.525e+03, threshold=8.417e+02, percent-clipped=19.0 2023-06-21 07:05:24,266 INFO [train.py:996] (0/4) Epoch 4, batch 30450, loss[loss=0.2912, simple_loss=0.3928, pruned_loss=0.09478, over 19904.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3179, pruned_loss=0.08661, over 4198047.87 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-21 07:06:00,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=731604.0, ans=0.1 2023-06-21 07:06:51,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=731664.0, ans=0.0 2023-06-21 07:09:32,105 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-4.pt 2023-06-21 07:12:03,769 INFO [train.py:996] (0/4) Epoch 5, batch 0, loss[loss=0.2593, simple_loss=0.3148, pruned_loss=0.1019, over 21544.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3148, pruned_loss=0.1019, over 21544.00 frames. ], batch size: 391, lr: 6.61e-03, grad_scale: 32.0 2023-06-21 07:12:03,770 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 07:12:43,344 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2379, simple_loss=0.3479, pruned_loss=0.06395, over 1796401.00 frames. 2023-06-21 07:12:43,345 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 07:13:38,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-21 07:14:21,775 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:14:47,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=732114.0, ans=0.0 2023-06-21 07:15:06,033 INFO [train.py:996] (0/4) Epoch 5, batch 50, loss[loss=0.2341, simple_loss=0.3109, pruned_loss=0.07863, over 21412.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3146, pruned_loss=0.08343, over 955888.49 frames. ], batch size: 211, lr: 6.60e-03, grad_scale: 32.0 2023-06-21 07:15:13,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.133e+02 4.909e+02 7.707e+02 2.246e+03, threshold=9.818e+02, percent-clipped=21.0 2023-06-21 07:15:44,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=732294.0, ans=0.0 2023-06-21 07:17:02,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=732414.0, ans=0.015 2023-06-21 07:17:05,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-21 07:17:09,034 INFO [train.py:996] (0/4) Epoch 5, batch 100, loss[loss=0.2761, simple_loss=0.3749, pruned_loss=0.08861, over 21851.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3374, pruned_loss=0.0879, over 1687860.39 frames. ], batch size: 316, lr: 6.60e-03, grad_scale: 32.0 2023-06-21 07:17:47,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=732534.0, ans=0.05 2023-06-21 07:17:51,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-21 07:17:58,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=732534.0, ans=0.125 2023-06-21 07:18:35,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=732594.0, ans=0.125 2023-06-21 07:18:49,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=732654.0, ans=0.125 2023-06-21 07:18:50,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=732654.0, ans=0.1 2023-06-21 07:18:51,221 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-21 07:19:30,624 INFO [train.py:996] (0/4) Epoch 5, batch 150, loss[loss=0.2503, simple_loss=0.363, pruned_loss=0.06879, over 19799.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.337, pruned_loss=0.08679, over 2260810.68 frames. ], batch size: 702, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:19:31,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=732774.0, ans=0.2 2023-06-21 07:19:43,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.471e+02 2.754e+02 3.178e+02 4.719e+02, threshold=5.509e+02, percent-clipped=0.0 2023-06-21 07:19:49,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=732774.0, ans=0.125 2023-06-21 07:20:09,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=732834.0, ans=0.125 2023-06-21 07:20:59,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=732894.0, ans=0.2 2023-06-21 07:22:07,533 INFO [train.py:996] (0/4) Epoch 5, batch 200, loss[loss=0.27, simple_loss=0.3535, pruned_loss=0.09326, over 21442.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3345, pruned_loss=0.08545, over 2695750.31 frames. ], batch size: 131, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:23:16,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=733194.0, ans=0.125 2023-06-21 07:23:22,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=733194.0, ans=0.125 2023-06-21 07:23:25,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-21 07:23:35,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-21 07:24:19,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=733314.0, ans=0.125 2023-06-21 07:24:31,200 INFO [train.py:996] (0/4) Epoch 5, batch 250, loss[loss=0.2608, simple_loss=0.3259, pruned_loss=0.09785, over 21825.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3305, pruned_loss=0.08697, over 3043844.58 frames. ], batch size: 282, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:24:31,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=733374.0, ans=0.05 2023-06-21 07:24:34,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.530e+02 2.881e+02 3.593e+02 5.629e+02, threshold=5.761e+02, percent-clipped=1.0 2023-06-21 07:24:43,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=733374.0, ans=0.0 2023-06-21 07:25:36,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 07:25:45,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=733494.0, ans=0.125 2023-06-21 07:26:23,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=733554.0, ans=0.2 2023-06-21 07:26:48,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=733614.0, ans=0.2 2023-06-21 07:26:59,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=733614.0, ans=0.1 2023-06-21 07:27:07,460 INFO [train.py:996] (0/4) Epoch 5, batch 300, loss[loss=0.2218, simple_loss=0.3044, pruned_loss=0.06955, over 21309.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3254, pruned_loss=0.08585, over 3319628.12 frames. ], batch size: 159, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:27:53,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=733734.0, ans=0.125 2023-06-21 07:29:39,575 INFO [train.py:996] (0/4) Epoch 5, batch 350, loss[loss=0.2077, simple_loss=0.2825, pruned_loss=0.06642, over 21723.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3184, pruned_loss=0.08443, over 3528680.67 frames. ], batch size: 282, lr: 6.60e-03, grad_scale: 16.0 2023-06-21 07:29:51,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.569e+02 2.912e+02 3.547e+02 5.180e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-21 07:29:52,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-21 07:31:10,850 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:31:51,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=734214.0, ans=0.125 2023-06-21 07:32:06,476 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:32:07,380 INFO [train.py:996] (0/4) Epoch 5, batch 400, loss[loss=0.1916, simple_loss=0.294, pruned_loss=0.04465, over 21803.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3116, pruned_loss=0.08126, over 3702922.40 frames. ], batch size: 316, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:32:48,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=734334.0, ans=0.0 2023-06-21 07:34:39,340 INFO [train.py:996] (0/4) Epoch 5, batch 450, loss[loss=0.2039, simple_loss=0.261, pruned_loss=0.07344, over 21298.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3113, pruned_loss=0.08107, over 3831655.68 frames. ], batch size: 160, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:34:47,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.593e+02 3.148e+02 3.879e+02 6.028e+02, threshold=6.296e+02, percent-clipped=1.0 2023-06-21 07:35:29,632 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:36:11,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=734754.0, ans=15.0 2023-06-21 07:36:42,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=734814.0, ans=0.2 2023-06-21 07:37:14,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734814.0, ans=0.1 2023-06-21 07:37:22,100 INFO [train.py:996] (0/4) Epoch 5, batch 500, loss[loss=0.2021, simple_loss=0.2584, pruned_loss=0.07286, over 20732.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3087, pruned_loss=0.07997, over 3933660.71 frames. ], batch size: 608, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:37:38,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=734874.0, ans=0.125 2023-06-21 07:37:54,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=734934.0, ans=0.0 2023-06-21 07:37:55,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=734994.0, ans=0.2 2023-06-21 07:39:34,427 INFO [train.py:996] (0/4) Epoch 5, batch 550, loss[loss=0.3138, simple_loss=0.3888, pruned_loss=0.1194, over 21576.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3092, pruned_loss=0.07938, over 4013554.35 frames. ], batch size: 441, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:39:56,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.656e+02 3.238e+02 3.999e+02 7.986e+02, threshold=6.476e+02, percent-clipped=2.0 2023-06-21 07:40:34,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=735294.0, ans=0.125 2023-06-21 07:42:15,952 INFO [train.py:996] (0/4) Epoch 5, batch 600, loss[loss=0.2423, simple_loss=0.3319, pruned_loss=0.07631, over 21675.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3128, pruned_loss=0.08055, over 4081761.94 frames. ], batch size: 230, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:42:27,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=735474.0, ans=0.2 2023-06-21 07:42:42,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=735534.0, ans=0.0 2023-06-21 07:43:15,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=735594.0, ans=0.0 2023-06-21 07:43:23,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=735594.0, ans=0.04949747468305833 2023-06-21 07:44:10,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=735714.0, ans=0.0 2023-06-21 07:44:38,941 INFO [train.py:996] (0/4) Epoch 5, batch 650, loss[loss=0.2103, simple_loss=0.2953, pruned_loss=0.06264, over 19950.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3151, pruned_loss=0.07981, over 4125160.88 frames. ], batch size: 703, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:44:39,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=735774.0, ans=0.2 2023-06-21 07:44:43,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.568e+02 2.858e+02 3.474e+02 5.611e+02, threshold=5.715e+02, percent-clipped=0.0 2023-06-21 07:44:50,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735774.0, ans=0.1 2023-06-21 07:45:02,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=735774.0, ans=0.2 2023-06-21 07:45:14,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=735834.0, ans=0.0 2023-06-21 07:45:57,937 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:46:02,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735954.0, ans=0.1 2023-06-21 07:47:07,521 INFO [train.py:996] (0/4) Epoch 5, batch 700, loss[loss=0.2504, simple_loss=0.3309, pruned_loss=0.08501, over 21886.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.316, pruned_loss=0.08063, over 4167270.67 frames. ], batch size: 124, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:47:15,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736074.0, ans=0.1 2023-06-21 07:47:15,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=736074.0, ans=0.2 2023-06-21 07:48:13,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=736194.0, ans=0.0 2023-06-21 07:48:40,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=736254.0, ans=0.0 2023-06-21 07:48:43,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=736254.0, ans=0.2 2023-06-21 07:48:55,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=736314.0, ans=0.09899494936611666 2023-06-21 07:49:04,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736314.0, ans=0.1 2023-06-21 07:49:41,157 INFO [train.py:996] (0/4) Epoch 5, batch 750, loss[loss=0.2313, simple_loss=0.2971, pruned_loss=0.08273, over 21945.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3137, pruned_loss=0.08085, over 4193472.09 frames. ], batch size: 316, lr: 6.59e-03, grad_scale: 32.0 2023-06-21 07:49:41,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=736374.0, ans=0.2 2023-06-21 07:49:43,941 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.823e+02 3.263e+02 3.934e+02 5.736e+02, threshold=6.525e+02, percent-clipped=1.0 2023-06-21 07:50:04,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=736434.0, ans=0.125 2023-06-21 07:52:01,040 INFO [train.py:996] (0/4) Epoch 5, batch 800, loss[loss=0.2476, simple_loss=0.2949, pruned_loss=0.1001, over 21347.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3107, pruned_loss=0.0811, over 4203818.82 frames. ], batch size: 471, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:52:28,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-21 07:53:27,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=736854.0, ans=0.2 2023-06-21 07:54:25,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736914.0, ans=0.1 2023-06-21 07:54:31,393 INFO [train.py:996] (0/4) Epoch 5, batch 850, loss[loss=0.1868, simple_loss=0.2465, pruned_loss=0.0636, over 17053.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3083, pruned_loss=0.08125, over 4212070.33 frames. ], batch size: 60, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:54:34,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.467e+02 2.771e+02 3.268e+02 5.744e+02, threshold=5.542e+02, percent-clipped=0.0 2023-06-21 07:54:42,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=736974.0, ans=0.0 2023-06-21 07:55:03,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=737034.0, ans=0.04949747468305833 2023-06-21 07:55:34,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=737094.0, ans=0.125 2023-06-21 07:56:14,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=737154.0, ans=0.125 2023-06-21 07:56:48,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=737274.0, ans=0.09899494936611666 2023-06-21 07:56:54,549 INFO [train.py:996] (0/4) Epoch 5, batch 900, loss[loss=0.2291, simple_loss=0.2917, pruned_loss=0.08325, over 21695.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3076, pruned_loss=0.08076, over 4230852.25 frames. ], batch size: 230, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:57:32,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=737334.0, ans=0.125 2023-06-21 07:57:42,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=737334.0, ans=0.0 2023-06-21 07:57:59,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-21 07:58:18,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=737394.0, ans=0.2 2023-06-21 07:58:25,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2023-06-21 07:58:37,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=737454.0, ans=0.125 2023-06-21 07:59:01,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=737514.0, ans=0.125 2023-06-21 07:59:28,522 INFO [train.py:996] (0/4) Epoch 5, batch 950, loss[loss=0.2284, simple_loss=0.3175, pruned_loss=0.06966, over 21763.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3066, pruned_loss=0.08035, over 4243085.19 frames. ], batch size: 351, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 07:59:31,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.538e+02 2.883e+02 3.307e+02 5.189e+02, threshold=5.766e+02, percent-clipped=0.0 2023-06-21 08:00:03,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=737574.0, ans=0.0 2023-06-21 08:00:13,442 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:00:14,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=737634.0, ans=0.125 2023-06-21 08:00:26,523 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:02:02,910 INFO [train.py:996] (0/4) Epoch 5, batch 1000, loss[loss=0.2636, simple_loss=0.3375, pruned_loss=0.09484, over 21339.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3071, pruned_loss=0.08038, over 4258672.48 frames. ], batch size: 176, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 08:02:18,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-21 08:02:38,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=737934.0, ans=0.125 2023-06-21 08:04:25,212 INFO [train.py:996] (0/4) Epoch 5, batch 1050, loss[loss=0.2176, simple_loss=0.284, pruned_loss=0.07556, over 21667.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3063, pruned_loss=0.08075, over 4260585.96 frames. ], batch size: 263, lr: 6.58e-03, grad_scale: 32.0 2023-06-21 08:04:28,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.450e+02 2.796e+02 3.213e+02 4.581e+02, threshold=5.591e+02, percent-clipped=0.0 2023-06-21 08:04:44,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=738174.0, ans=0.0 2023-06-21 08:04:54,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=738234.0, ans=0.125 2023-06-21 08:05:21,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738234.0, ans=0.1 2023-06-21 08:06:09,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=738354.0, ans=0.125 2023-06-21 08:06:28,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738354.0, ans=0.1 2023-06-21 08:07:09,336 INFO [train.py:996] (0/4) Epoch 5, batch 1100, loss[loss=0.2076, simple_loss=0.2606, pruned_loss=0.07735, over 20334.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3058, pruned_loss=0.08011, over 4262645.69 frames. ], batch size: 703, lr: 6.58e-03, grad_scale: 16.0 2023-06-21 08:07:27,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738474.0, ans=0.1 2023-06-21 08:09:06,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-21 08:09:08,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=738714.0, ans=0.0 2023-06-21 08:09:28,099 INFO [train.py:996] (0/4) Epoch 5, batch 1150, loss[loss=0.3375, simple_loss=0.3746, pruned_loss=0.1502, over 21717.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.307, pruned_loss=0.08055, over 4269701.69 frames. ], batch size: 507, lr: 6.57e-03, grad_scale: 16.0 2023-06-21 08:09:34,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738774.0, ans=0.1 2023-06-21 08:09:37,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.476e+02 2.814e+02 3.322e+02 5.569e+02, threshold=5.628e+02, percent-clipped=0.0 2023-06-21 08:10:00,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-21 08:10:43,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-21 08:11:41,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-21 08:12:23,024 INFO [train.py:996] (0/4) Epoch 5, batch 1200, loss[loss=0.2032, simple_loss=0.2469, pruned_loss=0.07974, over 20807.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3075, pruned_loss=0.08039, over 4271685.69 frames. ], batch size: 608, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:13:17,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=739194.0, ans=0.0 2023-06-21 08:14:26,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=739314.0, ans=0.125 2023-06-21 08:14:39,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=739374.0, ans=0.125 2023-06-21 08:14:40,379 INFO [train.py:996] (0/4) Epoch 5, batch 1250, loss[loss=0.262, simple_loss=0.3312, pruned_loss=0.09637, over 21186.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3098, pruned_loss=0.081, over 4269779.83 frames. ], batch size: 143, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:14:48,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.637e+02 3.101e+02 3.888e+02 6.560e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-21 08:15:14,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=739374.0, ans=0.05 2023-06-21 08:16:06,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=739554.0, ans=0.125 2023-06-21 08:16:54,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=739614.0, ans=0.07 2023-06-21 08:17:07,372 INFO [train.py:996] (0/4) Epoch 5, batch 1300, loss[loss=0.2482, simple_loss=0.3224, pruned_loss=0.08699, over 21727.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3111, pruned_loss=0.08176, over 4278131.63 frames. ], batch size: 298, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:17:59,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=739734.0, ans=0.125 2023-06-21 08:18:13,055 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:18:23,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=739794.0, ans=0.1 2023-06-21 08:18:35,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=739854.0, ans=0.125 2023-06-21 08:18:36,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-21 08:18:55,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=739854.0, ans=0.125 2023-06-21 08:18:56,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=739854.0, ans=0.0 2023-06-21 08:19:01,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=739854.0, ans=0.05 2023-06-21 08:19:41,996 INFO [train.py:996] (0/4) Epoch 5, batch 1350, loss[loss=0.2515, simple_loss=0.3189, pruned_loss=0.09212, over 21850.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3125, pruned_loss=0.08243, over 4287323.34 frames. ], batch size: 107, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:19:51,966 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.635e+02 2.966e+02 3.709e+02 5.719e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-21 08:20:31,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=740034.0, ans=0.0 2023-06-21 08:20:36,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=740094.0, ans=0.125 2023-06-21 08:21:12,070 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:21:20,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=740154.0, ans=0.125 2023-06-21 08:21:28,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=740154.0, ans=0.125 2023-06-21 08:21:43,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=740214.0, ans=0.1 2023-06-21 08:22:06,201 INFO [train.py:996] (0/4) Epoch 5, batch 1400, loss[loss=0.217, simple_loss=0.2921, pruned_loss=0.07095, over 21691.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3115, pruned_loss=0.08327, over 4282990.16 frames. ], batch size: 282, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:22:57,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=740334.0, ans=0.025 2023-06-21 08:24:10,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-21 08:24:29,378 INFO [train.py:996] (0/4) Epoch 5, batch 1450, loss[loss=0.2482, simple_loss=0.3024, pruned_loss=0.09701, over 21663.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.31, pruned_loss=0.08343, over 4289616.16 frames. ], batch size: 414, lr: 6.57e-03, grad_scale: 32.0 2023-06-21 08:24:40,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.440e+02 2.892e+02 3.416e+02 5.937e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-21 08:25:08,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=740634.0, ans=0.0 2023-06-21 08:25:09,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=12.0 2023-06-21 08:26:02,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=740754.0, ans=0.125 2023-06-21 08:26:15,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=740814.0, ans=0.0 2023-06-21 08:26:55,908 INFO [train.py:996] (0/4) Epoch 5, batch 1500, loss[loss=0.2144, simple_loss=0.3082, pruned_loss=0.06024, over 21621.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3117, pruned_loss=0.08473, over 4294938.18 frames. ], batch size: 263, lr: 6.57e-03, grad_scale: 16.0 2023-06-21 08:27:07,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=740874.0, ans=0.1 2023-06-21 08:27:22,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=740934.0, ans=0.125 2023-06-21 08:28:13,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-21 08:29:22,635 INFO [train.py:996] (0/4) Epoch 5, batch 1550, loss[loss=0.2247, simple_loss=0.3092, pruned_loss=0.07006, over 21765.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3101, pruned_loss=0.08369, over 4288917.44 frames. ], batch size: 298, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:29:24,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=741174.0, ans=0.125 2023-06-21 08:29:34,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.613e+02 3.171e+02 3.955e+02 6.837e+02, threshold=6.341e+02, percent-clipped=1.0 2023-06-21 08:30:28,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=741234.0, ans=10.0 2023-06-21 08:30:37,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=741294.0, ans=0.125 2023-06-21 08:31:16,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=741354.0, ans=0.125 2023-06-21 08:32:00,147 INFO [train.py:996] (0/4) Epoch 5, batch 1600, loss[loss=0.3199, simple_loss=0.3939, pruned_loss=0.123, over 21438.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3084, pruned_loss=0.08246, over 4288849.78 frames. ], batch size: 507, lr: 6.56e-03, grad_scale: 32.0 2023-06-21 08:32:00,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=741474.0, ans=0.2 2023-06-21 08:32:01,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-21 08:32:03,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=741474.0, ans=0.2 2023-06-21 08:32:16,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=741474.0, ans=0.0 2023-06-21 08:33:58,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=741714.0, ans=0.0 2023-06-21 08:34:18,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=741714.0, ans=0.125 2023-06-21 08:34:26,755 INFO [train.py:996] (0/4) Epoch 5, batch 1650, loss[loss=0.2248, simple_loss=0.2928, pruned_loss=0.07839, over 21719.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3098, pruned_loss=0.08275, over 4280230.25 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:34:39,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.504e+02 2.926e+02 3.545e+02 5.904e+02, threshold=5.852e+02, percent-clipped=0.0 2023-06-21 08:34:50,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-21 08:34:57,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=741834.0, ans=0.0 2023-06-21 08:35:28,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=741894.0, ans=0.125 2023-06-21 08:35:37,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=741894.0, ans=0.125 2023-06-21 08:37:08,121 INFO [train.py:996] (0/4) Epoch 5, batch 1700, loss[loss=0.2283, simple_loss=0.302, pruned_loss=0.07734, over 21665.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3127, pruned_loss=0.08294, over 4285663.88 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:37:10,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=742074.0, ans=0.0 2023-06-21 08:37:28,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=742134.0, ans=0.1 2023-06-21 08:38:15,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-21 08:38:18,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-21 08:38:32,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-21 08:39:18,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=742314.0, ans=0.0 2023-06-21 08:39:33,371 INFO [train.py:996] (0/4) Epoch 5, batch 1750, loss[loss=0.2729, simple_loss=0.3568, pruned_loss=0.09452, over 21472.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3127, pruned_loss=0.08195, over 4277799.59 frames. ], batch size: 471, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:39:51,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.601e+02 3.019e+02 3.659e+02 6.555e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-21 08:40:36,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=742494.0, ans=0.95 2023-06-21 08:41:58,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-21 08:42:27,574 INFO [train.py:996] (0/4) Epoch 5, batch 1800, loss[loss=0.1664, simple_loss=0.224, pruned_loss=0.05439, over 21834.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3086, pruned_loss=0.07892, over 4272760.36 frames. ], batch size: 118, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:42:49,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=742734.0, ans=0.05 2023-06-21 08:44:16,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=742854.0, ans=0.0 2023-06-21 08:44:52,832 INFO [train.py:996] (0/4) Epoch 5, batch 1850, loss[loss=0.2263, simple_loss=0.3105, pruned_loss=0.07102, over 21811.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.311, pruned_loss=0.07677, over 4276669.91 frames. ], batch size: 282, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:45:21,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 2.344e+02 2.714e+02 3.170e+02 5.790e+02, threshold=5.429e+02, percent-clipped=0.0 2023-06-21 08:45:24,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-21 08:45:51,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-21 08:46:03,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-21 08:46:05,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-21 08:46:47,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=743154.0, ans=0.125 2023-06-21 08:47:39,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=743274.0, ans=0.2 2023-06-21 08:47:40,692 INFO [train.py:996] (0/4) Epoch 5, batch 1900, loss[loss=0.2244, simple_loss=0.3072, pruned_loss=0.07078, over 21806.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3102, pruned_loss=0.07644, over 4280445.89 frames. ], batch size: 351, lr: 6.56e-03, grad_scale: 16.0 2023-06-21 08:47:44,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=743274.0, ans=0.125 2023-06-21 08:47:56,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=743334.0, ans=0.125 2023-06-21 08:48:38,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=743394.0, ans=0.125 2023-06-21 08:48:39,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=15.0 2023-06-21 08:49:01,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=743454.0, ans=0.125 2023-06-21 08:49:22,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743454.0, ans=0.1 2023-06-21 08:49:37,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=12.0 2023-06-21 08:49:48,612 INFO [train.py:996] (0/4) Epoch 5, batch 1950, loss[loss=0.2455, simple_loss=0.3444, pruned_loss=0.07329, over 19775.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3077, pruned_loss=0.07742, over 4284495.36 frames. ], batch size: 703, lr: 6.55e-03, grad_scale: 16.0 2023-06-21 08:50:13,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-21 08:50:13,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.637e+02 3.082e+02 3.738e+02 5.890e+02, threshold=6.165e+02, percent-clipped=2.0 2023-06-21 08:50:19,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 08:50:43,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=743634.0, ans=0.05 2023-06-21 08:52:28,569 INFO [train.py:996] (0/4) Epoch 5, batch 2000, loss[loss=0.248, simple_loss=0.3423, pruned_loss=0.07685, over 21841.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3028, pruned_loss=0.07681, over 4274437.94 frames. ], batch size: 372, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:52:51,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=743874.0, ans=0.2 2023-06-21 08:53:15,406 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-124000.pt 2023-06-21 08:54:20,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=744054.0, ans=0.05 2023-06-21 08:54:21,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=744054.0, ans=0.125 2023-06-21 08:54:44,410 INFO [train.py:996] (0/4) Epoch 5, batch 2050, loss[loss=0.2269, simple_loss=0.2883, pruned_loss=0.08279, over 21510.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.304, pruned_loss=0.07658, over 4273528.63 frames. ], batch size: 389, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:55:04,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.606e+02 2.968e+02 3.488e+02 6.810e+02, threshold=5.937e+02, percent-clipped=2.0 2023-06-21 08:55:49,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-21 08:56:11,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=744294.0, ans=0.125 2023-06-21 08:57:04,038 INFO [train.py:996] (0/4) Epoch 5, batch 2100, loss[loss=0.2457, simple_loss=0.3169, pruned_loss=0.0872, over 21690.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3071, pruned_loss=0.07842, over 4279473.43 frames. ], batch size: 112, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 08:57:15,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=744474.0, ans=0.2 2023-06-21 08:57:30,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=744534.0, ans=0.125 2023-06-21 08:58:00,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.01 vs. limit=10.0 2023-06-21 08:58:17,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=744594.0, ans=0.125 2023-06-21 08:58:42,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=27.94 vs. limit=22.5 2023-06-21 08:58:47,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-21 08:59:35,518 INFO [train.py:996] (0/4) Epoch 5, batch 2150, loss[loss=0.2224, simple_loss=0.3247, pruned_loss=0.06001, over 21200.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3077, pruned_loss=0.07908, over 4278782.00 frames. ], batch size: 548, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:00:06,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.554e+02 3.162e+02 3.775e+02 6.322e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-21 09:00:42,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=744894.0, ans=0.125 2023-06-21 09:01:09,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=744954.0, ans=0.0 2023-06-21 09:02:00,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-21 09:02:07,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-21 09:02:08,197 INFO [train.py:996] (0/4) Epoch 5, batch 2200, loss[loss=0.2165, simple_loss=0.281, pruned_loss=0.07607, over 16868.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3119, pruned_loss=0.07975, over 4274351.12 frames. ], batch size: 64, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:02:09,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=8.0 2023-06-21 09:02:57,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=745134.0, ans=0.125 2023-06-21 09:04:05,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=745254.0, ans=0.125 2023-06-21 09:04:12,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=745314.0, ans=0.0 2023-06-21 09:04:39,507 INFO [train.py:996] (0/4) Epoch 5, batch 2250, loss[loss=0.2448, simple_loss=0.3125, pruned_loss=0.08852, over 21542.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3089, pruned_loss=0.07841, over 4273604.26 frames. ], batch size: 441, lr: 6.55e-03, grad_scale: 32.0 2023-06-21 09:04:48,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.467e+02 2.789e+02 3.250e+02 6.214e+02, threshold=5.578e+02, percent-clipped=0.0 2023-06-21 09:05:02,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=745374.0, ans=0.125 2023-06-21 09:05:43,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=745494.0, ans=0.025 2023-06-21 09:06:06,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=745554.0, ans=0.125 2023-06-21 09:06:10,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=745554.0, ans=0.125 2023-06-21 09:06:20,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 09:06:33,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=745614.0, ans=0.0 2023-06-21 09:06:49,389 INFO [train.py:996] (0/4) Epoch 5, batch 2300, loss[loss=0.2047, simple_loss=0.2672, pruned_loss=0.07113, over 21385.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3025, pruned_loss=0.07803, over 4271947.09 frames. ], batch size: 211, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:07:36,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=745734.0, ans=0.125 2023-06-21 09:08:03,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=745794.0, ans=0.125 2023-06-21 09:08:13,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=745854.0, ans=0.2 2023-06-21 09:08:25,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=745854.0, ans=0.2 2023-06-21 09:09:04,699 INFO [train.py:996] (0/4) Epoch 5, batch 2350, loss[loss=0.2232, simple_loss=0.282, pruned_loss=0.0822, over 21213.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2991, pruned_loss=0.0782, over 4269246.60 frames. ], batch size: 159, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:09:18,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.648e+02 3.067e+02 3.920e+02 6.116e+02, threshold=6.134e+02, percent-clipped=2.0 2023-06-21 09:10:18,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=22.5 2023-06-21 09:10:57,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-21 09:11:13,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=746214.0, ans=0.07 2023-06-21 09:11:16,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=746214.0, ans=0.125 2023-06-21 09:11:18,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-21 09:11:46,925 INFO [train.py:996] (0/4) Epoch 5, batch 2400, loss[loss=0.2618, simple_loss=0.3254, pruned_loss=0.09912, over 21809.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3008, pruned_loss=0.07976, over 4262182.24 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:11:56,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-21 09:11:56,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-21 09:12:33,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=746334.0, ans=0.125 2023-06-21 09:12:44,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=746394.0, ans=0.0 2023-06-21 09:13:30,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-21 09:14:12,800 INFO [train.py:996] (0/4) Epoch 5, batch 2450, loss[loss=0.2045, simple_loss=0.2763, pruned_loss=0.06632, over 21600.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3087, pruned_loss=0.08239, over 4250834.43 frames. ], batch size: 247, lr: 6.54e-03, grad_scale: 32.0 2023-06-21 09:14:38,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.774e+02 3.114e+02 3.672e+02 6.323e+02, threshold=6.229e+02, percent-clipped=1.0 2023-06-21 09:14:44,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-21 09:16:05,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=746814.0, ans=0.2 2023-06-21 09:16:23,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=746874.0, ans=0.0 2023-06-21 09:16:24,621 INFO [train.py:996] (0/4) Epoch 5, batch 2500, loss[loss=0.2382, simple_loss=0.3125, pruned_loss=0.08196, over 21994.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.307, pruned_loss=0.08275, over 4255271.95 frames. ], batch size: 103, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:16:32,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=746874.0, ans=0.2 2023-06-21 09:17:43,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2023-06-21 09:18:42,277 INFO [train.py:996] (0/4) Epoch 5, batch 2550, loss[loss=0.2062, simple_loss=0.2826, pruned_loss=0.06493, over 21721.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3048, pruned_loss=0.0822, over 4265662.82 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:18:45,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=747174.0, ans=0.0 2023-06-21 09:18:48,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=747174.0, ans=0.125 2023-06-21 09:18:50,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.510e+02 2.866e+02 3.285e+02 4.415e+02, threshold=5.731e+02, percent-clipped=0.0 2023-06-21 09:18:54,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-21 09:19:11,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=747234.0, ans=0.2 2023-06-21 09:19:14,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=747234.0, ans=0.125 2023-06-21 09:19:46,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-21 09:19:54,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=747294.0, ans=0.04949747468305833 2023-06-21 09:20:59,460 INFO [train.py:996] (0/4) Epoch 5, batch 2600, loss[loss=0.2647, simple_loss=0.3296, pruned_loss=0.09984, over 21737.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3065, pruned_loss=0.0823, over 4262119.37 frames. ], batch size: 247, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:21:14,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=747474.0, ans=0.0 2023-06-21 09:22:06,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=747594.0, ans=0.1 2023-06-21 09:22:16,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=747594.0, ans=0.125 2023-06-21 09:22:57,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-21 09:23:19,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=747714.0, ans=0.0 2023-06-21 09:23:19,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=747714.0, ans=0.0 2023-06-21 09:23:25,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=747774.0, ans=0.125 2023-06-21 09:23:26,448 INFO [train.py:996] (0/4) Epoch 5, batch 2650, loss[loss=0.2441, simple_loss=0.3177, pruned_loss=0.08523, over 21452.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.307, pruned_loss=0.08316, over 4264510.83 frames. ], batch size: 131, lr: 6.54e-03, grad_scale: 16.0 2023-06-21 09:23:31,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-06-21 09:23:35,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.828e+02 3.188e+02 4.094e+02 7.867e+02, threshold=6.375e+02, percent-clipped=3.0 2023-06-21 09:24:00,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747834.0, ans=0.1 2023-06-21 09:24:06,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=747834.0, ans=0.1 2023-06-21 09:25:27,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748014.0, ans=0.1 2023-06-21 09:25:27,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748014.0, ans=0.1 2023-06-21 09:25:29,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-21 09:25:52,130 INFO [train.py:996] (0/4) Epoch 5, batch 2700, loss[loss=0.2521, simple_loss=0.331, pruned_loss=0.08657, over 21576.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3061, pruned_loss=0.08265, over 4264685.60 frames. ], batch size: 473, lr: 6.53e-03, grad_scale: 16.0 2023-06-21 09:26:15,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-21 09:26:54,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=748194.0, ans=0.125 2023-06-21 09:27:19,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=748254.0, ans=0.0 2023-06-21 09:28:14,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=748314.0, ans=0.0 2023-06-21 09:28:18,191 INFO [train.py:996] (0/4) Epoch 5, batch 2750, loss[loss=0.2994, simple_loss=0.3364, pruned_loss=0.1312, over 21792.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3042, pruned_loss=0.08253, over 4271522.83 frames. ], batch size: 508, lr: 6.53e-03, grad_scale: 16.0 2023-06-21 09:28:26,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=748374.0, ans=0.0 2023-06-21 09:28:33,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.495e+02 2.844e+02 3.275e+02 5.915e+02, threshold=5.688e+02, percent-clipped=0.0 2023-06-21 09:28:58,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748434.0, ans=0.1 2023-06-21 09:29:17,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.95 vs. limit=22.5 2023-06-21 09:29:23,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=748494.0, ans=0.125 2023-06-21 09:30:03,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-21 09:30:03,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=748554.0, ans=0.2 2023-06-21 09:31:01,731 INFO [train.py:996] (0/4) Epoch 5, batch 2800, loss[loss=0.2261, simple_loss=0.2856, pruned_loss=0.08334, over 21315.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3098, pruned_loss=0.08386, over 4273115.84 frames. ], batch size: 131, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:31:13,631 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:31:34,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=748734.0, ans=0.125 2023-06-21 09:32:19,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=748794.0, ans=0.0 2023-06-21 09:32:21,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=748794.0, ans=0.125 2023-06-21 09:33:40,820 INFO [train.py:996] (0/4) Epoch 5, batch 2850, loss[loss=0.181, simple_loss=0.2393, pruned_loss=0.06136, over 21430.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3121, pruned_loss=0.08378, over 4266693.04 frames. ], batch size: 194, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:33:51,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.47 vs. limit=22.5 2023-06-21 09:33:58,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=748974.0, ans=10.0 2023-06-21 09:34:00,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.664e+02 4.232e+02 8.442e+02, threshold=7.329e+02, percent-clipped=6.0 2023-06-21 09:34:02,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=748974.0, ans=0.125 2023-06-21 09:35:16,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=749154.0, ans=0.125 2023-06-21 09:35:19,656 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:36:12,181 INFO [train.py:996] (0/4) Epoch 5, batch 2900, loss[loss=0.1915, simple_loss=0.2424, pruned_loss=0.07034, over 20732.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3077, pruned_loss=0.08213, over 4261366.24 frames. ], batch size: 608, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:36:31,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=749334.0, ans=0.125 2023-06-21 09:36:43,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-21 09:37:32,507 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:37:40,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=749454.0, ans=0.0 2023-06-21 09:38:01,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=22.5 2023-06-21 09:38:03,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=749454.0, ans=0.125 2023-06-21 09:38:34,558 INFO [train.py:996] (0/4) Epoch 5, batch 2950, loss[loss=0.1678, simple_loss=0.2304, pruned_loss=0.05259, over 21905.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.309, pruned_loss=0.08286, over 4270782.08 frames. ], batch size: 98, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:38:48,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.603e+02 2.918e+02 3.396e+02 5.731e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 09:39:39,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=749694.0, ans=0.0 2023-06-21 09:39:39,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=749694.0, ans=0.125 2023-06-21 09:39:40,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=749694.0, ans=0.0 2023-06-21 09:40:03,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-21 09:40:23,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=749754.0, ans=0.1 2023-06-21 09:40:56,767 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:40:58,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=749814.0, ans=0.125 2023-06-21 09:41:07,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=749814.0, ans=0.125 2023-06-21 09:41:10,936 INFO [train.py:996] (0/4) Epoch 5, batch 3000, loss[loss=0.2081, simple_loss=0.2755, pruned_loss=0.07036, over 21628.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3118, pruned_loss=0.08313, over 4277624.28 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 32.0 2023-06-21 09:41:10,937 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 09:42:09,550 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2543, simple_loss=0.346, pruned_loss=0.08133, over 1796401.00 frames. 2023-06-21 09:42:09,552 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 09:42:20,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=749874.0, ans=0.0 2023-06-21 09:42:52,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=8.0 2023-06-21 09:43:19,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-21 09:44:23,315 INFO [train.py:996] (0/4) Epoch 5, batch 3050, loss[loss=0.2705, simple_loss=0.3407, pruned_loss=0.1002, over 21543.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3119, pruned_loss=0.08174, over 4272219.95 frames. ], batch size: 508, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:44:39,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-21 09:44:43,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.527e+02 2.843e+02 3.371e+02 5.319e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-21 09:44:58,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=750234.0, ans=0.0 2023-06-21 09:45:07,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=750234.0, ans=0.0 2023-06-21 09:46:48,370 INFO [train.py:996] (0/4) Epoch 5, batch 3100, loss[loss=0.2064, simple_loss=0.2938, pruned_loss=0.05948, over 21577.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3119, pruned_loss=0.08066, over 4273824.44 frames. ], batch size: 230, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:47:40,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=15.0 2023-06-21 09:48:09,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750594.0, ans=0.1 2023-06-21 09:48:55,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=750714.0, ans=0.0 2023-06-21 09:48:58,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=750714.0, ans=0.125 2023-06-21 09:49:02,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=750714.0, ans=0.0 2023-06-21 09:49:08,174 INFO [train.py:996] (0/4) Epoch 5, batch 3150, loss[loss=0.2728, simple_loss=0.3438, pruned_loss=0.1009, over 21571.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3137, pruned_loss=0.08169, over 4273326.52 frames. ], batch size: 389, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:49:26,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.529e+02 2.952e+02 3.587e+02 6.103e+02, threshold=5.905e+02, percent-clipped=1.0 2023-06-21 09:50:04,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=750834.0, ans=0.2 2023-06-21 09:50:20,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=750894.0, ans=0.2 2023-06-21 09:51:58,574 INFO [train.py:996] (0/4) Epoch 5, batch 3200, loss[loss=0.2819, simple_loss=0.3594, pruned_loss=0.1022, over 21429.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3145, pruned_loss=0.08131, over 4277007.69 frames. ], batch size: 507, lr: 6.52e-03, grad_scale: 32.0 2023-06-21 09:52:14,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=751074.0, ans=0.125 2023-06-21 09:52:27,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751134.0, ans=0.1 2023-06-21 09:52:48,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=751194.0, ans=0.05 2023-06-21 09:53:45,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-21 09:53:48,000 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:54:09,984 INFO [train.py:996] (0/4) Epoch 5, batch 3250, loss[loss=0.2257, simple_loss=0.2951, pruned_loss=0.07812, over 21855.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3159, pruned_loss=0.08337, over 4276703.94 frames. ], batch size: 98, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:54:31,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-21 09:54:34,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.744e+02 3.235e+02 3.683e+02 5.247e+02, threshold=6.470e+02, percent-clipped=0.0 2023-06-21 09:54:50,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=751434.0, ans=0.0 2023-06-21 09:55:38,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=751494.0, ans=0.125 2023-06-21 09:56:16,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.36 vs. limit=22.5 2023-06-21 09:56:21,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=15.0 2023-06-21 09:56:36,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=751674.0, ans=0.0 2023-06-21 09:56:54,122 INFO [train.py:996] (0/4) Epoch 5, batch 3300, loss[loss=0.2268, simple_loss=0.2908, pruned_loss=0.08144, over 15188.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3118, pruned_loss=0.08313, over 4274059.30 frames. ], batch size: 61, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:57:04,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=751674.0, ans=0.02 2023-06-21 09:57:09,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-21 09:57:12,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=751674.0, ans=15.0 2023-06-21 09:57:18,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=751734.0, ans=0.125 2023-06-21 09:58:14,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-21 09:59:05,171 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:59:09,223 INFO [train.py:996] (0/4) Epoch 5, batch 3350, loss[loss=0.2305, simple_loss=0.2804, pruned_loss=0.09029, over 20058.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3143, pruned_loss=0.08354, over 4276969.51 frames. ], batch size: 704, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 09:59:38,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.776e+02 3.180e+02 3.722e+02 8.013e+02, threshold=6.359e+02, percent-clipped=4.0 2023-06-21 10:01:43,803 INFO [train.py:996] (0/4) Epoch 5, batch 3400, loss[loss=0.2984, simple_loss=0.4024, pruned_loss=0.09714, over 20758.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3159, pruned_loss=0.08458, over 4275430.53 frames. ], batch size: 607, lr: 6.52e-03, grad_scale: 16.0 2023-06-21 10:02:00,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=752274.0, ans=0.1 2023-06-21 10:03:17,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=752454.0, ans=0.2 2023-06-21 10:03:52,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=752514.0, ans=0.125 2023-06-21 10:04:00,533 INFO [train.py:996] (0/4) Epoch 5, batch 3450, loss[loss=0.2279, simple_loss=0.2901, pruned_loss=0.08283, over 21503.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3113, pruned_loss=0.08355, over 4276389.32 frames. ], batch size: 230, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:04:11,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.675e+02 2.919e+02 3.509e+02 4.747e+02, threshold=5.839e+02, percent-clipped=0.0 2023-06-21 10:06:18,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=752814.0, ans=0.125 2023-06-21 10:06:26,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=752874.0, ans=0.1 2023-06-21 10:06:27,098 INFO [train.py:996] (0/4) Epoch 5, batch 3500, loss[loss=0.2904, simple_loss=0.3581, pruned_loss=0.1113, over 21707.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3199, pruned_loss=0.08745, over 4281932.51 frames. ], batch size: 351, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:06:58,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=752934.0, ans=0.1 2023-06-21 10:06:58,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-06-21 10:08:29,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=753114.0, ans=0.1 2023-06-21 10:08:50,571 INFO [train.py:996] (0/4) Epoch 5, batch 3550, loss[loss=0.2098, simple_loss=0.2759, pruned_loss=0.07182, over 21616.00 frames. ], tot_loss[loss=0.251, simple_loss=0.323, pruned_loss=0.08955, over 4284372.04 frames. ], batch size: 247, lr: 6.51e-03, grad_scale: 16.0 2023-06-21 10:09:01,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-21 10:09:06,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.617e+02 3.171e+02 3.907e+02 6.956e+02, threshold=6.342e+02, percent-clipped=4.0 2023-06-21 10:10:10,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=753354.0, ans=0.2 2023-06-21 10:10:35,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=753414.0, ans=0.125 2023-06-21 10:10:52,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=15.0 2023-06-21 10:10:52,550 INFO [train.py:996] (0/4) Epoch 5, batch 3600, loss[loss=0.2461, simple_loss=0.3107, pruned_loss=0.09078, over 21477.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3184, pruned_loss=0.08917, over 4275434.02 frames. ], batch size: 211, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:11:08,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=753534.0, ans=15.0 2023-06-21 10:11:17,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=753534.0, ans=0.0 2023-06-21 10:11:40,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=753594.0, ans=0.125 2023-06-21 10:11:53,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-21 10:11:54,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=753654.0, ans=0.1 2023-06-21 10:12:13,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=753654.0, ans=0.125 2023-06-21 10:12:36,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=753654.0, ans=0.2 2023-06-21 10:12:38,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=753714.0, ans=0.125 2023-06-21 10:12:59,210 INFO [train.py:996] (0/4) Epoch 5, batch 3650, loss[loss=0.23, simple_loss=0.3162, pruned_loss=0.07194, over 21665.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3192, pruned_loss=0.08883, over 4280415.04 frames. ], batch size: 389, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:13:10,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.602e+02 2.931e+02 3.344e+02 6.459e+02, threshold=5.862e+02, percent-clipped=1.0 2023-06-21 10:14:31,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=753954.0, ans=0.1 2023-06-21 10:15:19,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-21 10:15:21,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=754014.0, ans=0.0 2023-06-21 10:15:29,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=754014.0, ans=0.125 2023-06-21 10:15:32,321 INFO [train.py:996] (0/4) Epoch 5, batch 3700, loss[loss=0.2329, simple_loss=0.2957, pruned_loss=0.08502, over 21818.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3169, pruned_loss=0.08711, over 4275787.27 frames. ], batch size: 102, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:16:13,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.51 vs. limit=10.0 2023-06-21 10:16:17,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=754194.0, ans=0.05 2023-06-21 10:16:59,391 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:17:04,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-21 10:17:26,640 INFO [train.py:996] (0/4) Epoch 5, batch 3750, loss[loss=0.1764, simple_loss=0.2518, pruned_loss=0.05053, over 21501.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3142, pruned_loss=0.08648, over 4272136.51 frames. ], batch size: 212, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:17:37,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.410e+02 2.787e+02 3.137e+02 4.786e+02, threshold=5.574e+02, percent-clipped=0.0 2023-06-21 10:17:38,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=754374.0, ans=0.125 2023-06-21 10:18:49,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=754494.0, ans=0.2 2023-06-21 10:19:10,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=754554.0, ans=0.125 2023-06-21 10:20:02,581 INFO [train.py:996] (0/4) Epoch 5, batch 3800, loss[loss=0.2968, simple_loss=0.3546, pruned_loss=0.1195, over 21488.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3124, pruned_loss=0.0842, over 4273888.29 frames. ], batch size: 509, lr: 6.51e-03, grad_scale: 32.0 2023-06-21 10:20:26,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-21 10:20:44,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=754734.0, ans=0.125 2023-06-21 10:21:20,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=754854.0, ans=0.125 2023-06-21 10:21:55,056 INFO [train.py:996] (0/4) Epoch 5, batch 3850, loss[loss=0.2047, simple_loss=0.2669, pruned_loss=0.07118, over 21901.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3102, pruned_loss=0.08428, over 4256325.12 frames. ], batch size: 107, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:22:22,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.524e+02 3.055e+02 3.931e+02 8.028e+02, threshold=6.111e+02, percent-clipped=3.0 2023-06-21 10:22:27,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 10:22:48,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=755094.0, ans=0.125 2023-06-21 10:23:53,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=755214.0, ans=0.0 2023-06-21 10:24:04,950 INFO [train.py:996] (0/4) Epoch 5, batch 3900, loss[loss=0.2248, simple_loss=0.2884, pruned_loss=0.08059, over 21828.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3054, pruned_loss=0.08384, over 4263948.08 frames. ], batch size: 332, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:25:30,793 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:25:38,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=755454.0, ans=0.125 2023-06-21 10:26:27,955 INFO [train.py:996] (0/4) Epoch 5, batch 3950, loss[loss=0.2406, simple_loss=0.3275, pruned_loss=0.07687, over 21517.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3077, pruned_loss=0.083, over 4271563.98 frames. ], batch size: 471, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:26:33,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=755574.0, ans=0.2 2023-06-21 10:26:46,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.463e+02 2.789e+02 3.515e+02 5.351e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-21 10:27:30,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=755634.0, ans=0.2 2023-06-21 10:28:21,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=755814.0, ans=0.0 2023-06-21 10:28:48,539 INFO [train.py:996] (0/4) Epoch 5, batch 4000, loss[loss=0.1987, simple_loss=0.2642, pruned_loss=0.06658, over 21638.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3006, pruned_loss=0.07913, over 4273080.34 frames. ], batch size: 282, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:28:50,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=755874.0, ans=0.125 2023-06-21 10:29:54,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=755994.0, ans=0.125 2023-06-21 10:30:22,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756054.0, ans=0.1 2023-06-21 10:30:33,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=756054.0, ans=0.125 2023-06-21 10:30:54,375 INFO [train.py:996] (0/4) Epoch 5, batch 4050, loss[loss=0.2898, simple_loss=0.3506, pruned_loss=0.1145, over 21498.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3018, pruned_loss=0.07845, over 4271306.49 frames. ], batch size: 507, lr: 6.50e-03, grad_scale: 32.0 2023-06-21 10:31:18,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756174.0, ans=0.1 2023-06-21 10:31:20,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 2.419e+02 2.822e+02 3.375e+02 5.095e+02, threshold=5.643e+02, percent-clipped=0.0 2023-06-21 10:32:03,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=756234.0, ans=0.0 2023-06-21 10:32:04,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=756234.0, ans=0.125 2023-06-21 10:33:22,489 INFO [train.py:996] (0/4) Epoch 5, batch 4100, loss[loss=0.2221, simple_loss=0.3039, pruned_loss=0.07014, over 16817.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3019, pruned_loss=0.07885, over 4269335.45 frames. ], batch size: 60, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:34:08,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=756594.0, ans=0.125 2023-06-21 10:34:33,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=756654.0, ans=0.125 2023-06-21 10:34:56,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756714.0, ans=0.1 2023-06-21 10:35:00,540 INFO [train.py:996] (0/4) Epoch 5, batch 4150, loss[loss=0.2029, simple_loss=0.2861, pruned_loss=0.05986, over 21637.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3013, pruned_loss=0.07521, over 4275300.62 frames. ], batch size: 263, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:35:12,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 2.501e+02 2.940e+02 3.469e+02 5.994e+02, threshold=5.880e+02, percent-clipped=1.0 2023-06-21 10:36:12,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=756954.0, ans=0.1 2023-06-21 10:36:24,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=757014.0, ans=0.125 2023-06-21 10:36:30,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=757014.0, ans=0.125 2023-06-21 10:36:46,358 INFO [train.py:996] (0/4) Epoch 5, batch 4200, loss[loss=0.2608, simple_loss=0.3437, pruned_loss=0.08892, over 21537.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3031, pruned_loss=0.0762, over 4268411.73 frames. ], batch size: 389, lr: 6.50e-03, grad_scale: 16.0 2023-06-21 10:38:02,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=757194.0, ans=0.125 2023-06-21 10:38:30,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=757314.0, ans=0.0 2023-06-21 10:38:58,468 INFO [train.py:996] (0/4) Epoch 5, batch 4250, loss[loss=0.2786, simple_loss=0.3597, pruned_loss=0.09876, over 21738.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3105, pruned_loss=0.07894, over 4266992.30 frames. ], batch size: 118, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:39:19,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.640e+02 3.180e+02 4.167e+02 9.459e+02, threshold=6.360e+02, percent-clipped=16.0 2023-06-21 10:39:21,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=757434.0, ans=0.125 2023-06-21 10:40:08,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=757494.0, ans=0.125 2023-06-21 10:40:18,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=757554.0, ans=0.125 2023-06-21 10:40:29,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=757554.0, ans=0.0 2023-06-21 10:40:45,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=757614.0, ans=0.0 2023-06-21 10:40:59,218 INFO [train.py:996] (0/4) Epoch 5, batch 4300, loss[loss=0.2082, simple_loss=0.2965, pruned_loss=0.05999, over 21403.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.315, pruned_loss=0.08007, over 4258996.68 frames. ], batch size: 211, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:43:29,614 INFO [train.py:996] (0/4) Epoch 5, batch 4350, loss[loss=0.2009, simple_loss=0.2717, pruned_loss=0.06504, over 21450.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3111, pruned_loss=0.0789, over 4250349.96 frames. ], batch size: 212, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:43:46,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.644e+02 3.174e+02 3.856e+02 7.919e+02, threshold=6.347e+02, percent-clipped=3.0 2023-06-21 10:44:31,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=758094.0, ans=0.04949747468305833 2023-06-21 10:44:48,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=758154.0, ans=0.125 2023-06-21 10:44:54,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=758154.0, ans=0.125 2023-06-21 10:45:25,084 INFO [train.py:996] (0/4) Epoch 5, batch 4400, loss[loss=0.2307, simple_loss=0.3257, pruned_loss=0.06784, over 21790.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3081, pruned_loss=0.07807, over 4251971.38 frames. ], batch size: 282, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:46:33,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=758394.0, ans=0.0 2023-06-21 10:46:46,935 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:46:56,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=758454.0, ans=0.125 2023-06-21 10:47:33,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=758514.0, ans=0.2 2023-06-21 10:47:44,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=758514.0, ans=0.125 2023-06-21 10:47:50,938 INFO [train.py:996] (0/4) Epoch 5, batch 4450, loss[loss=0.2058, simple_loss=0.2653, pruned_loss=0.07309, over 20756.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3138, pruned_loss=0.07916, over 4253017.01 frames. ], batch size: 609, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:48:07,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=758574.0, ans=0.2 2023-06-21 10:48:08,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.525e+02 2.989e+02 3.664e+02 5.986e+02, threshold=5.979e+02, percent-clipped=0.0 2023-06-21 10:48:48,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=758634.0, ans=0.125 2023-06-21 10:50:13,879 INFO [train.py:996] (0/4) Epoch 5, batch 4500, loss[loss=0.2365, simple_loss=0.3136, pruned_loss=0.07968, over 21179.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3162, pruned_loss=0.08129, over 4265288.00 frames. ], batch size: 143, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:50:14,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=758874.0, ans=0.0 2023-06-21 10:50:32,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=758874.0, ans=0.1 2023-06-21 10:50:42,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=758934.0, ans=0.125 2023-06-21 10:50:51,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=758934.0, ans=0.1 2023-06-21 10:51:04,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=758994.0, ans=0.125 2023-06-21 10:51:07,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=758994.0, ans=0.125 2023-06-21 10:51:55,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-21 10:52:36,235 INFO [train.py:996] (0/4) Epoch 5, batch 4550, loss[loss=0.2098, simple_loss=0.2686, pruned_loss=0.0755, over 21184.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3189, pruned_loss=0.08177, over 4268553.31 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 32.0 2023-06-21 10:52:55,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=759174.0, ans=0.125 2023-06-21 10:53:00,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.644e+02 2.944e+02 3.521e+02 6.236e+02, threshold=5.889e+02, percent-clipped=2.0 2023-06-21 10:53:05,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-21 10:53:21,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=759234.0, ans=0.125 2023-06-21 10:54:03,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=759354.0, ans=0.2 2023-06-21 10:54:09,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=759354.0, ans=0.125 2023-06-21 10:54:53,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=759474.0, ans=0.0 2023-06-21 10:54:54,674 INFO [train.py:996] (0/4) Epoch 5, batch 4600, loss[loss=0.2526, simple_loss=0.3334, pruned_loss=0.0859, over 21475.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3204, pruned_loss=0.08235, over 4273675.41 frames. ], batch size: 211, lr: 6.49e-03, grad_scale: 16.0 2023-06-21 10:55:36,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-21 10:55:41,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=759594.0, ans=0.1 2023-06-21 10:55:44,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=759594.0, ans=0.035 2023-06-21 10:55:46,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-21 10:56:35,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=759714.0, ans=0.1 2023-06-21 10:57:01,240 INFO [train.py:996] (0/4) Epoch 5, batch 4650, loss[loss=0.1859, simple_loss=0.2554, pruned_loss=0.05818, over 21758.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3154, pruned_loss=0.08154, over 4278567.66 frames. ], batch size: 298, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 10:57:32,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=759774.0, ans=0.1 2023-06-21 10:57:33,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.446e+02 2.893e+02 3.577e+02 6.132e+02, threshold=5.786e+02, percent-clipped=2.0 2023-06-21 10:58:26,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=759954.0, ans=0.0 2023-06-21 10:58:28,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=759954.0, ans=0.1 2023-06-21 10:59:22,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-21 10:59:26,247 INFO [train.py:996] (0/4) Epoch 5, batch 4700, loss[loss=0.2013, simple_loss=0.2704, pruned_loss=0.06615, over 21401.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3067, pruned_loss=0.07965, over 4274878.46 frames. ], batch size: 131, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 10:59:48,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=760134.0, ans=0.0 2023-06-21 11:01:18,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=760314.0, ans=0.2 2023-06-21 11:01:42,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=760374.0, ans=0.0 2023-06-21 11:01:43,755 INFO [train.py:996] (0/4) Epoch 5, batch 4750, loss[loss=0.2139, simple_loss=0.2731, pruned_loss=0.07736, over 21594.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.303, pruned_loss=0.07964, over 4269574.24 frames. ], batch size: 231, lr: 6.48e-03, grad_scale: 16.0 2023-06-21 11:02:03,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=760374.0, ans=0.125 2023-06-21 11:02:04,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.535e+02 2.851e+02 3.318e+02 5.705e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-21 11:02:12,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=760434.0, ans=0.0 2023-06-21 11:02:45,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=8.0 2023-06-21 11:02:49,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-21 11:03:10,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=760554.0, ans=0.1 2023-06-21 11:03:11,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-06-21 11:03:42,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=760614.0, ans=0.125 2023-06-21 11:03:54,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=760674.0, ans=0.0 2023-06-21 11:03:55,408 INFO [train.py:996] (0/4) Epoch 5, batch 4800, loss[loss=0.2359, simple_loss=0.3042, pruned_loss=0.08384, over 21920.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3036, pruned_loss=0.07995, over 4278133.54 frames. ], batch size: 316, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:04:27,631 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:05:33,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-21 11:05:59,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=760914.0, ans=0.125 2023-06-21 11:06:05,295 INFO [train.py:996] (0/4) Epoch 5, batch 4850, loss[loss=0.2371, simple_loss=0.3129, pruned_loss=0.08067, over 21839.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3042, pruned_loss=0.08029, over 4277388.31 frames. ], batch size: 332, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:06:26,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.507e+02 2.882e+02 3.548e+02 6.033e+02, threshold=5.763e+02, percent-clipped=2.0 2023-06-21 11:06:42,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=761034.0, ans=0.125 2023-06-21 11:07:14,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=761154.0, ans=0.0 2023-06-21 11:08:19,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=761214.0, ans=0.0 2023-06-21 11:08:30,024 INFO [train.py:996] (0/4) Epoch 5, batch 4900, loss[loss=0.2701, simple_loss=0.3369, pruned_loss=0.1016, over 21270.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3062, pruned_loss=0.08049, over 4279956.62 frames. ], batch size: 143, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:08:31,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-21 11:08:39,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=761274.0, ans=0.0 2023-06-21 11:08:55,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-21 11:09:27,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=761394.0, ans=0.09899494936611666 2023-06-21 11:09:55,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=761454.0, ans=0.1 2023-06-21 11:10:38,844 INFO [train.py:996] (0/4) Epoch 5, batch 4950, loss[loss=0.1917, simple_loss=0.2855, pruned_loss=0.049, over 21737.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3089, pruned_loss=0.07869, over 4279248.59 frames. ], batch size: 298, lr: 6.48e-03, grad_scale: 32.0 2023-06-21 11:10:52,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.398e+02 2.770e+02 3.056e+02 4.888e+02, threshold=5.540e+02, percent-clipped=0.0 2023-06-21 11:11:39,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-21 11:11:59,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=761754.0, ans=0.2 2023-06-21 11:12:03,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=761754.0, ans=0.0 2023-06-21 11:12:05,299 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:12:27,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=761814.0, ans=0.0 2023-06-21 11:12:30,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=761814.0, ans=0.2 2023-06-21 11:12:40,749 INFO [train.py:996] (0/4) Epoch 5, batch 5000, loss[loss=0.2178, simple_loss=0.297, pruned_loss=0.0693, over 21517.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3071, pruned_loss=0.07638, over 4283207.58 frames. ], batch size: 194, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:13:54,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=762054.0, ans=0.07 2023-06-21 11:14:13,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-21 11:14:46,857 INFO [train.py:996] (0/4) Epoch 5, batch 5050, loss[loss=0.2291, simple_loss=0.3, pruned_loss=0.07908, over 21687.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3076, pruned_loss=0.07783, over 4292099.76 frames. ], batch size: 230, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:15:06,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.567e+02 3.027e+02 3.438e+02 5.567e+02, threshold=6.054e+02, percent-clipped=1.0 2023-06-21 11:15:26,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 11:15:34,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=762294.0, ans=0.2 2023-06-21 11:16:58,409 INFO [train.py:996] (0/4) Epoch 5, batch 5100, loss[loss=0.2315, simple_loss=0.3032, pruned_loss=0.07988, over 21300.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3068, pruned_loss=0.07851, over 4292854.22 frames. ], batch size: 176, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:17:20,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=762534.0, ans=0.0 2023-06-21 11:18:34,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-21 11:19:19,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=15.0 2023-06-21 11:19:20,184 INFO [train.py:996] (0/4) Epoch 5, batch 5150, loss[loss=0.2133, simple_loss=0.2804, pruned_loss=0.07316, over 21607.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.305, pruned_loss=0.0793, over 4297018.05 frames. ], batch size: 263, lr: 6.47e-03, grad_scale: 16.0 2023-06-21 11:19:36,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.594e+02 2.911e+02 3.354e+02 4.463e+02, threshold=5.822e+02, percent-clipped=0.0 2023-06-21 11:19:49,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=762834.0, ans=0.125 2023-06-21 11:20:19,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=762894.0, ans=0.125 2023-06-21 11:21:03,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=22.5 2023-06-21 11:21:39,099 INFO [train.py:996] (0/4) Epoch 5, batch 5200, loss[loss=0.2301, simple_loss=0.3122, pruned_loss=0.07401, over 21275.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3062, pruned_loss=0.08002, over 4288132.54 frames. ], batch size: 159, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:22:32,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=763134.0, ans=0.125 2023-06-21 11:23:23,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-21 11:23:45,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-21 11:24:02,728 INFO [train.py:996] (0/4) Epoch 5, batch 5250, loss[loss=0.2309, simple_loss=0.3125, pruned_loss=0.0746, over 21734.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3087, pruned_loss=0.07808, over 4287986.38 frames. ], batch size: 298, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:24:23,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.681e+02 2.958e+02 3.448e+02 5.597e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-21 11:24:38,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=763494.0, ans=0.125 2023-06-21 11:24:51,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=763494.0, ans=0.125 2023-06-21 11:26:08,820 INFO [train.py:996] (0/4) Epoch 5, batch 5300, loss[loss=0.2319, simple_loss=0.3116, pruned_loss=0.0761, over 21900.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3078, pruned_loss=0.07783, over 4291621.81 frames. ], batch size: 333, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:26:39,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-21 11:26:41,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-21 11:26:59,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=763794.0, ans=0.025 2023-06-21 11:27:33,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-21 11:28:24,468 INFO [train.py:996] (0/4) Epoch 5, batch 5350, loss[loss=0.2383, simple_loss=0.2972, pruned_loss=0.08966, over 21298.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.306, pruned_loss=0.07924, over 4297289.61 frames. ], batch size: 159, lr: 6.47e-03, grad_scale: 32.0 2023-06-21 11:28:24,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=763974.0, ans=0.125 2023-06-21 11:28:39,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=763974.0, ans=12.0 2023-06-21 11:28:44,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.381e+02 2.626e+02 3.022e+02 5.115e+02, threshold=5.252e+02, percent-clipped=0.0 2023-06-21 11:28:49,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=764034.0, ans=0.125 2023-06-21 11:28:53,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=764034.0, ans=0.125 2023-06-21 11:28:56,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764034.0, ans=0.125 2023-06-21 11:29:04,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=764094.0, ans=0.05 2023-06-21 11:30:02,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=764154.0, ans=0.0 2023-06-21 11:30:36,293 INFO [train.py:996] (0/4) Epoch 5, batch 5400, loss[loss=0.2554, simple_loss=0.3238, pruned_loss=0.09347, over 21545.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3076, pruned_loss=0.08111, over 4288593.45 frames. ], batch size: 471, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:31:31,122 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:31:37,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764394.0, ans=0.1 2023-06-21 11:33:00,994 INFO [train.py:996] (0/4) Epoch 5, batch 5450, loss[loss=0.2613, simple_loss=0.3433, pruned_loss=0.08961, over 19917.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3089, pruned_loss=0.07905, over 4290067.36 frames. ], batch size: 702, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:33:09,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=764574.0, ans=0.0 2023-06-21 11:33:15,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=764634.0, ans=0.125 2023-06-21 11:33:18,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.472e+02 2.911e+02 3.692e+02 6.272e+02, threshold=5.821e+02, percent-clipped=3.0 2023-06-21 11:33:20,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=764634.0, ans=0.0 2023-06-21 11:34:30,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=764754.0, ans=0.125 2023-06-21 11:34:39,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-21 11:34:43,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=764754.0, ans=0.1 2023-06-21 11:35:04,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764814.0, ans=0.125 2023-06-21 11:35:10,244 INFO [train.py:996] (0/4) Epoch 5, batch 5500, loss[loss=0.2192, simple_loss=0.317, pruned_loss=0.06075, over 21750.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3122, pruned_loss=0.07601, over 4283678.45 frames. ], batch size: 332, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:35:12,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=764874.0, ans=0.125 2023-06-21 11:35:21,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=764874.0, ans=0.125 2023-06-21 11:35:23,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=764874.0, ans=0.1 2023-06-21 11:36:20,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=764994.0, ans=0.125 2023-06-21 11:37:15,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=765114.0, ans=0.125 2023-06-21 11:37:21,027 INFO [train.py:996] (0/4) Epoch 5, batch 5550, loss[loss=0.1843, simple_loss=0.2747, pruned_loss=0.04692, over 21644.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3118, pruned_loss=0.07392, over 4283162.57 frames. ], batch size: 263, lr: 6.46e-03, grad_scale: 16.0 2023-06-21 11:37:23,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=765174.0, ans=0.0 2023-06-21 11:38:09,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.186e+02 2.465e+02 2.869e+02 4.676e+02, threshold=4.930e+02, percent-clipped=0.0 2023-06-21 11:38:54,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=765294.0, ans=0.0 2023-06-21 11:39:04,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765354.0, ans=0.1 2023-06-21 11:39:55,440 INFO [train.py:996] (0/4) Epoch 5, batch 5600, loss[loss=0.2243, simple_loss=0.2837, pruned_loss=0.08244, over 19935.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3082, pruned_loss=0.07101, over 4280896.89 frames. ], batch size: 703, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:42:13,677 INFO [train.py:996] (0/4) Epoch 5, batch 5650, loss[loss=0.2296, simple_loss=0.2971, pruned_loss=0.08107, over 21218.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3119, pruned_loss=0.07359, over 4273441.99 frames. ], batch size: 143, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:42:36,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.490e+02 2.930e+02 3.681e+02 6.971e+02, threshold=5.860e+02, percent-clipped=6.0 2023-06-21 11:43:16,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-21 11:43:21,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=765894.0, ans=0.125 2023-06-21 11:43:34,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=765954.0, ans=0.125 2023-06-21 11:44:08,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=766014.0, ans=0.1 2023-06-21 11:44:28,801 INFO [train.py:996] (0/4) Epoch 5, batch 5700, loss[loss=0.1993, simple_loss=0.2857, pruned_loss=0.05644, over 21706.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3131, pruned_loss=0.07631, over 4277538.81 frames. ], batch size: 298, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:45:20,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=766134.0, ans=0.125 2023-06-21 11:45:25,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=22.5 2023-06-21 11:46:33,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=766314.0, ans=0.0 2023-06-21 11:46:49,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=766314.0, ans=0.2 2023-06-21 11:47:09,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=766374.0, ans=0.2 2023-06-21 11:47:09,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=766374.0, ans=0.125 2023-06-21 11:47:17,585 INFO [train.py:996] (0/4) Epoch 5, batch 5750, loss[loss=0.1849, simple_loss=0.2687, pruned_loss=0.0505, over 21453.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3089, pruned_loss=0.07377, over 4273647.48 frames. ], batch size: 195, lr: 6.46e-03, grad_scale: 32.0 2023-06-21 11:47:31,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=766374.0, ans=0.0 2023-06-21 11:47:40,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.289e+02 2.668e+02 3.231e+02 5.394e+02, threshold=5.337e+02, percent-clipped=0.0 2023-06-21 11:47:54,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=766434.0, ans=0.0 2023-06-21 11:48:16,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-21 11:48:35,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-21 11:49:05,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=766554.0, ans=0.125 2023-06-21 11:49:57,047 INFO [train.py:996] (0/4) Epoch 5, batch 5800, loss[loss=0.2498, simple_loss=0.3197, pruned_loss=0.08998, over 21394.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3083, pruned_loss=0.07264, over 4280685.47 frames. ], batch size: 548, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:50:14,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=766734.0, ans=0.125 2023-06-21 11:50:49,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=766794.0, ans=0.125 2023-06-21 11:52:11,023 INFO [train.py:996] (0/4) Epoch 5, batch 5850, loss[loss=0.1743, simple_loss=0.2697, pruned_loss=0.03943, over 21421.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.306, pruned_loss=0.06874, over 4282190.58 frames. ], batch size: 211, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:52:16,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-21 11:52:31,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 2.021e+02 2.544e+02 3.113e+02 4.412e+02, threshold=5.088e+02, percent-clipped=0.0 2023-06-21 11:53:36,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=767154.0, ans=0.125 2023-06-21 11:53:36,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=767154.0, ans=0.125 2023-06-21 11:54:15,383 INFO [train.py:996] (0/4) Epoch 5, batch 5900, loss[loss=0.2487, simple_loss=0.3225, pruned_loss=0.08745, over 21887.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2989, pruned_loss=0.06408, over 4272902.43 frames. ], batch size: 124, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:54:17,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=767274.0, ans=0.125 2023-06-21 11:55:37,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=767454.0, ans=0.1 2023-06-21 11:56:19,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=767514.0, ans=0.1 2023-06-21 11:56:31,137 INFO [train.py:996] (0/4) Epoch 5, batch 5950, loss[loss=0.217, simple_loss=0.2918, pruned_loss=0.07112, over 21834.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2984, pruned_loss=0.06649, over 4277323.43 frames. ], batch size: 124, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:56:54,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 2.360e+02 2.756e+02 3.348e+02 5.051e+02, threshold=5.512e+02, percent-clipped=0.0 2023-06-21 11:57:38,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=767694.0, ans=0.125 2023-06-21 11:57:59,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=767754.0, ans=0.05 2023-06-21 11:58:20,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=767814.0, ans=0.0 2023-06-21 11:58:38,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=767814.0, ans=0.125 2023-06-21 11:58:40,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=767814.0, ans=0.5 2023-06-21 11:58:52,388 INFO [train.py:996] (0/4) Epoch 5, batch 6000, loss[loss=0.2153, simple_loss=0.2786, pruned_loss=0.07595, over 21888.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2946, pruned_loss=0.06931, over 4281571.56 frames. ], batch size: 373, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 11:58:52,390 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 11:59:55,600 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2623, simple_loss=0.3577, pruned_loss=0.08348, over 1796401.00 frames. 2023-06-21 11:59:55,601 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 12:00:21,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=767934.0, ans=0.2 2023-06-21 12:00:33,480 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-128000.pt 2023-06-21 12:01:56,882 INFO [train.py:996] (0/4) Epoch 5, batch 6050, loss[loss=0.1947, simple_loss=0.2588, pruned_loss=0.0653, over 21907.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2898, pruned_loss=0.07101, over 4287482.72 frames. ], batch size: 113, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:01:58,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768174.0, ans=0.1 2023-06-21 12:02:27,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=768234.0, ans=0.125 2023-06-21 12:02:30,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.508e+02 2.739e+02 3.269e+02 4.730e+02, threshold=5.478e+02, percent-clipped=0.0 2023-06-21 12:02:32,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=768234.0, ans=0.125 2023-06-21 12:02:50,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=768294.0, ans=0.125 2023-06-21 12:03:25,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=768354.0, ans=0.0 2023-06-21 12:03:30,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-06-21 12:04:00,824 INFO [train.py:996] (0/4) Epoch 5, batch 6100, loss[loss=0.2111, simple_loss=0.2703, pruned_loss=0.07597, over 21205.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2865, pruned_loss=0.0701, over 4276401.69 frames. ], batch size: 608, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:04:52,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=768534.0, ans=0.04949747468305833 2023-06-21 12:05:02,958 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:05:55,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768714.0, ans=0.1 2023-06-21 12:06:07,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=768714.0, ans=0.0 2023-06-21 12:06:07,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=768714.0, ans=0.125 2023-06-21 12:06:19,078 INFO [train.py:996] (0/4) Epoch 5, batch 6150, loss[loss=0.2048, simple_loss=0.2755, pruned_loss=0.06698, over 21191.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2902, pruned_loss=0.07261, over 4287332.15 frames. ], batch size: 176, lr: 6.45e-03, grad_scale: 32.0 2023-06-21 12:06:34,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=768774.0, ans=0.0 2023-06-21 12:06:36,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=768774.0, ans=0.0 2023-06-21 12:06:40,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.345e+02 2.949e+02 3.402e+02 5.805e+02, threshold=5.898e+02, percent-clipped=1.0 2023-06-21 12:06:40,501 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:07:57,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.83 vs. limit=22.5 2023-06-21 12:08:23,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-21 12:08:32,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=769014.0, ans=0.025 2023-06-21 12:08:37,414 INFO [train.py:996] (0/4) Epoch 5, batch 6200, loss[loss=0.2553, simple_loss=0.3332, pruned_loss=0.08866, over 21837.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.296, pruned_loss=0.07333, over 4280054.35 frames. ], batch size: 351, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:09:01,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=769134.0, ans=0.0 2023-06-21 12:09:39,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-21 12:09:46,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=769194.0, ans=0.1 2023-06-21 12:09:49,784 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:10:36,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.97 vs. limit=12.0 2023-06-21 12:10:54,303 INFO [train.py:996] (0/4) Epoch 5, batch 6250, loss[loss=0.239, simple_loss=0.345, pruned_loss=0.06651, over 21737.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.301, pruned_loss=0.07284, over 4279116.82 frames. ], batch size: 298, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:10:59,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-21 12:11:01,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=769374.0, ans=22.5 2023-06-21 12:11:21,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 2.368e+02 2.704e+02 3.313e+02 4.790e+02, threshold=5.409e+02, percent-clipped=0.0 2023-06-21 12:11:41,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=769434.0, ans=0.2 2023-06-21 12:13:07,525 INFO [train.py:996] (0/4) Epoch 5, batch 6300, loss[loss=0.1949, simple_loss=0.2704, pruned_loss=0.05975, over 20020.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3039, pruned_loss=0.07172, over 4286552.08 frames. ], batch size: 703, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:13:12,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=769674.0, ans=0.125 2023-06-21 12:13:18,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=769674.0, ans=0.125 2023-06-21 12:15:21,312 INFO [train.py:996] (0/4) Epoch 5, batch 6350, loss[loss=0.2724, simple_loss=0.3401, pruned_loss=0.1023, over 21480.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3091, pruned_loss=0.0767, over 4292649.90 frames. ], batch size: 211, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:15:50,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.600e+02 2.925e+02 3.648e+02 4.818e+02, threshold=5.851e+02, percent-clipped=0.0 2023-06-21 12:16:39,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 12:17:48,484 INFO [train.py:996] (0/4) Epoch 5, batch 6400, loss[loss=0.2624, simple_loss=0.3341, pruned_loss=0.0953, over 21940.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3151, pruned_loss=0.08056, over 4283289.71 frames. ], batch size: 372, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:18:03,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=770274.0, ans=0.1 2023-06-21 12:18:14,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=770334.0, ans=0.125 2023-06-21 12:19:11,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-21 12:19:34,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-21 12:20:01,645 INFO [train.py:996] (0/4) Epoch 5, batch 6450, loss[loss=0.2073, simple_loss=0.305, pruned_loss=0.05475, over 21673.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3184, pruned_loss=0.08118, over 4284533.12 frames. ], batch size: 298, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:20:19,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.431e+02 2.805e+02 3.198e+02 5.945e+02, threshold=5.611e+02, percent-clipped=1.0 2023-06-21 12:21:11,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=770694.0, ans=0.035 2023-06-21 12:21:20,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=770754.0, ans=0.1 2023-06-21 12:21:58,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=770814.0, ans=0.2 2023-06-21 12:22:15,287 INFO [train.py:996] (0/4) Epoch 5, batch 6500, loss[loss=0.213, simple_loss=0.2756, pruned_loss=0.07519, over 21675.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3106, pruned_loss=0.07924, over 4274910.74 frames. ], batch size: 282, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:22:23,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2023-06-21 12:22:55,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=770934.0, ans=0.125 2023-06-21 12:24:41,657 INFO [train.py:996] (0/4) Epoch 5, batch 6550, loss[loss=0.259, simple_loss=0.3323, pruned_loss=0.09288, over 21743.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3089, pruned_loss=0.07812, over 4272953.70 frames. ], batch size: 414, lr: 6.44e-03, grad_scale: 32.0 2023-06-21 12:24:58,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.666e+02 3.007e+02 3.674e+02 6.110e+02, threshold=6.015e+02, percent-clipped=2.0 2023-06-21 12:25:00,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=771234.0, ans=0.1 2023-06-21 12:25:51,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=771294.0, ans=0.125 2023-06-21 12:26:16,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=771354.0, ans=0.125 2023-06-21 12:26:22,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=771414.0, ans=0.0 2023-06-21 12:26:52,305 INFO [train.py:996] (0/4) Epoch 5, batch 6600, loss[loss=0.2161, simple_loss=0.2803, pruned_loss=0.07595, over 21867.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3029, pruned_loss=0.07739, over 4265644.98 frames. ], batch size: 373, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:27:02,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=771474.0, ans=0.0 2023-06-21 12:27:47,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-21 12:28:37,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-21 12:28:56,945 INFO [train.py:996] (0/4) Epoch 5, batch 6650, loss[loss=0.1933, simple_loss=0.2593, pruned_loss=0.06366, over 21543.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2969, pruned_loss=0.07575, over 4268091.62 frames. ], batch size: 132, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:29:40,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.317e+02 2.595e+02 2.999e+02 4.464e+02, threshold=5.189e+02, percent-clipped=0.0 2023-06-21 12:30:24,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-21 12:31:08,799 INFO [train.py:996] (0/4) Epoch 5, batch 6700, loss[loss=0.2125, simple_loss=0.2712, pruned_loss=0.07696, over 21289.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2915, pruned_loss=0.07508, over 4272460.01 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:32:22,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=772254.0, ans=0.125 2023-06-21 12:33:00,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772314.0, ans=0.1 2023-06-21 12:33:12,800 INFO [train.py:996] (0/4) Epoch 5, batch 6750, loss[loss=0.2086, simple_loss=0.2782, pruned_loss=0.0695, over 21287.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2892, pruned_loss=0.07537, over 4262856.88 frames. ], batch size: 194, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:33:13,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-21 12:33:22,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-21 12:33:58,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.401e+02 2.813e+02 3.332e+02 5.748e+02, threshold=5.626e+02, percent-clipped=2.0 2023-06-21 12:34:16,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=772494.0, ans=0.125 2023-06-21 12:35:25,716 INFO [train.py:996] (0/4) Epoch 5, batch 6800, loss[loss=0.1983, simple_loss=0.2685, pruned_loss=0.064, over 21599.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2905, pruned_loss=0.07703, over 4261939.48 frames. ], batch size: 263, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:36:12,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772734.0, ans=0.1 2023-06-21 12:36:13,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772734.0, ans=0.1 2023-06-21 12:36:28,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=772794.0, ans=0.1 2023-06-21 12:36:38,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=772854.0, ans=0.125 2023-06-21 12:36:51,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=772854.0, ans=0.0 2023-06-21 12:37:12,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=772914.0, ans=0.2 2023-06-21 12:37:19,282 INFO [train.py:996] (0/4) Epoch 5, batch 6850, loss[loss=0.2573, simple_loss=0.3247, pruned_loss=0.09494, over 21852.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2894, pruned_loss=0.07739, over 4264221.99 frames. ], batch size: 118, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:38:02,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.533e+02 2.911e+02 3.331e+02 6.135e+02, threshold=5.822e+02, percent-clipped=1.0 2023-06-21 12:38:29,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2023-06-21 12:39:03,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773154.0, ans=0.1 2023-06-21 12:39:44,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=773214.0, ans=0.125 2023-06-21 12:39:46,838 INFO [train.py:996] (0/4) Epoch 5, batch 6900, loss[loss=0.2488, simple_loss=0.3414, pruned_loss=0.07808, over 21608.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2918, pruned_loss=0.07769, over 4264568.65 frames. ], batch size: 471, lr: 6.43e-03, grad_scale: 32.0 2023-06-21 12:41:13,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-06-21 12:41:35,815 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:42:06,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=773574.0, ans=0.2 2023-06-21 12:42:07,211 INFO [train.py:996] (0/4) Epoch 5, batch 6950, loss[loss=0.2426, simple_loss=0.3167, pruned_loss=0.0843, over 21659.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2939, pruned_loss=0.07521, over 4264373.56 frames. ], batch size: 351, lr: 6.43e-03, grad_scale: 16.0 2023-06-21 12:42:18,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=773574.0, ans=0.125 2023-06-21 12:42:18,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=773574.0, ans=0.125 2023-06-21 12:42:45,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.439e+02 2.793e+02 3.180e+02 5.285e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-21 12:42:45,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=773634.0, ans=0.125 2023-06-21 12:43:54,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=773814.0, ans=0.125 2023-06-21 12:44:15,867 INFO [train.py:996] (0/4) Epoch 5, batch 7000, loss[loss=0.2191, simple_loss=0.2891, pruned_loss=0.07456, over 21344.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.297, pruned_loss=0.07724, over 4269465.20 frames. ], batch size: 131, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:44:22,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773874.0, ans=0.1 2023-06-21 12:45:11,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=773934.0, ans=0.0 2023-06-21 12:45:17,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=773994.0, ans=0.125 2023-06-21 12:45:47,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-21 12:46:29,992 INFO [train.py:996] (0/4) Epoch 5, batch 7050, loss[loss=0.2232, simple_loss=0.3328, pruned_loss=0.05681, over 19753.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2951, pruned_loss=0.07555, over 4259316.42 frames. ], batch size: 703, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:46:35,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=774174.0, ans=0.0 2023-06-21 12:47:07,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.342e+02 2.952e+02 3.730e+02 6.285e+02, threshold=5.903e+02, percent-clipped=1.0 2023-06-21 12:47:36,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=774294.0, ans=0.2 2023-06-21 12:48:00,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=774294.0, ans=0.0 2023-06-21 12:48:32,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=774414.0, ans=0.2 2023-06-21 12:48:50,491 INFO [train.py:996] (0/4) Epoch 5, batch 7100, loss[loss=0.1858, simple_loss=0.2657, pruned_loss=0.053, over 21671.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2999, pruned_loss=0.07685, over 4262363.84 frames. ], batch size: 247, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:48:50,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=774474.0, ans=0.125 2023-06-21 12:49:13,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=774474.0, ans=0.125 2023-06-21 12:49:13,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=774474.0, ans=0.0 2023-06-21 12:49:37,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=774534.0, ans=0.2 2023-06-21 12:50:11,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=774654.0, ans=0.05 2023-06-21 12:50:55,950 INFO [train.py:996] (0/4) Epoch 5, batch 7150, loss[loss=0.2443, simple_loss=0.3125, pruned_loss=0.08805, over 21290.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2948, pruned_loss=0.07373, over 4258544.25 frames. ], batch size: 176, lr: 6.42e-03, grad_scale: 16.0 2023-06-21 12:51:08,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=774774.0, ans=0.125 2023-06-21 12:51:23,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-21 12:51:34,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.394e+02 2.698e+02 3.391e+02 6.183e+02, threshold=5.396e+02, percent-clipped=2.0 2023-06-21 12:52:03,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=774894.0, ans=15.0 2023-06-21 12:53:11,025 INFO [train.py:996] (0/4) Epoch 5, batch 7200, loss[loss=0.2222, simple_loss=0.2811, pruned_loss=0.0817, over 21161.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2982, pruned_loss=0.07637, over 4264227.73 frames. ], batch size: 176, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:53:25,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=775074.0, ans=0.125 2023-06-21 12:53:37,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=775134.0, ans=0.0 2023-06-21 12:53:39,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=775134.0, ans=0.2 2023-06-21 12:54:42,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=775254.0, ans=0.125 2023-06-21 12:55:20,764 INFO [train.py:996] (0/4) Epoch 5, batch 7250, loss[loss=0.2181, simple_loss=0.2687, pruned_loss=0.08375, over 21295.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.07639, over 4270038.22 frames. ], batch size: 177, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:55:42,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-21 12:55:56,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=775434.0, ans=0.07 2023-06-21 12:56:02,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.505e+02 2.857e+02 3.370e+02 5.311e+02, threshold=5.714e+02, percent-clipped=0.0 2023-06-21 12:56:12,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775434.0, ans=0.1 2023-06-21 12:56:20,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775494.0, ans=0.1 2023-06-21 12:56:20,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=775494.0, ans=0.125 2023-06-21 12:56:44,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=775554.0, ans=0.0 2023-06-21 12:57:12,910 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:57:17,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 12:57:38,535 INFO [train.py:996] (0/4) Epoch 5, batch 7300, loss[loss=0.2004, simple_loss=0.2582, pruned_loss=0.07132, over 21383.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.29, pruned_loss=0.07574, over 4269948.62 frames. ], batch size: 160, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 12:58:19,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=775734.0, ans=0.09899494936611666 2023-06-21 12:59:41,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=775974.0, ans=0.07 2023-06-21 12:59:42,998 INFO [train.py:996] (0/4) Epoch 5, batch 7350, loss[loss=0.2707, simple_loss=0.3325, pruned_loss=0.1045, over 21342.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2869, pruned_loss=0.07631, over 4269385.49 frames. ], batch size: 159, lr: 6.42e-03, grad_scale: 32.0 2023-06-21 13:00:25,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.511e+02 2.899e+02 3.601e+02 8.387e+02, threshold=5.798e+02, percent-clipped=4.0 2023-06-21 13:00:27,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=776034.0, ans=0.0 2023-06-21 13:00:30,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=776034.0, ans=0.0 2023-06-21 13:01:59,271 INFO [train.py:996] (0/4) Epoch 5, batch 7400, loss[loss=0.2133, simple_loss=0.2813, pruned_loss=0.0727, over 21354.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2921, pruned_loss=0.0789, over 4274829.94 frames. ], batch size: 176, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:02:13,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=776274.0, ans=0.125 2023-06-21 13:03:11,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-21 13:03:21,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-21 13:03:56,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=776514.0, ans=0.125 2023-06-21 13:04:05,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-21 13:04:10,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-21 13:04:15,305 INFO [train.py:996] (0/4) Epoch 5, batch 7450, loss[loss=0.2116, simple_loss=0.2765, pruned_loss=0.07336, over 21106.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2912, pruned_loss=0.07872, over 4275056.18 frames. ], batch size: 176, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:04:26,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=776574.0, ans=0.125 2023-06-21 13:04:34,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=776574.0, ans=0.0 2023-06-21 13:04:40,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=776634.0, ans=0.0 2023-06-21 13:04:40,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.565e+02 3.143e+02 4.189e+02 6.806e+02, threshold=6.287e+02, percent-clipped=2.0 2023-06-21 13:04:50,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=776694.0, ans=0.125 2023-06-21 13:05:00,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-21 13:05:01,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=776694.0, ans=0.2 2023-06-21 13:05:53,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=776814.0, ans=0.0 2023-06-21 13:06:30,092 INFO [train.py:996] (0/4) Epoch 5, batch 7500, loss[loss=0.2845, simple_loss=0.3512, pruned_loss=0.1089, over 21515.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2973, pruned_loss=0.08011, over 4274397.45 frames. ], batch size: 389, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:06:47,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=776934.0, ans=0.0 2023-06-21 13:07:11,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=776934.0, ans=0.125 2023-06-21 13:07:19,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=776994.0, ans=0.125 2023-06-21 13:07:27,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=776994.0, ans=0.0 2023-06-21 13:08:49,589 INFO [train.py:996] (0/4) Epoch 5, batch 7550, loss[loss=0.1977, simple_loss=0.2933, pruned_loss=0.05105, over 21579.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3079, pruned_loss=0.08031, over 4280520.71 frames. ], batch size: 230, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:08:59,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=777174.0, ans=22.5 2023-06-21 13:09:33,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.666e+02 3.013e+02 3.483e+02 5.379e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-21 13:10:09,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=777354.0, ans=0.0 2023-06-21 13:10:24,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=777354.0, ans=0.95 2023-06-21 13:10:25,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777354.0, ans=0.1 2023-06-21 13:10:56,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=777414.0, ans=0.125 2023-06-21 13:10:58,793 INFO [train.py:996] (0/4) Epoch 5, batch 7600, loss[loss=0.2186, simple_loss=0.289, pruned_loss=0.07407, over 21901.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3083, pruned_loss=0.07987, over 4291659.32 frames. ], batch size: 316, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:10:59,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=777474.0, ans=0.125 2023-06-21 13:11:16,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-21 13:11:37,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=777534.0, ans=0.125 2023-06-21 13:11:48,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=777534.0, ans=0.125 2023-06-21 13:12:12,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-21 13:12:54,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=777654.0, ans=0.0 2023-06-21 13:12:58,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=777714.0, ans=0.0 2023-06-21 13:13:18,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=10.0 2023-06-21 13:13:19,293 INFO [train.py:996] (0/4) Epoch 5, batch 7650, loss[loss=0.2466, simple_loss=0.3006, pruned_loss=0.09632, over 21569.00 frames. ], tot_loss[loss=0.234, simple_loss=0.306, pruned_loss=0.08094, over 4290815.95 frames. ], batch size: 195, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:13:28,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=777774.0, ans=0.2 2023-06-21 13:13:33,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-21 13:14:08,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.535e+02 2.921e+02 3.505e+02 6.410e+02, threshold=5.842e+02, percent-clipped=1.0 2023-06-21 13:15:14,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=777954.0, ans=0.125 2023-06-21 13:15:30,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=10.0 2023-06-21 13:15:45,169 INFO [train.py:996] (0/4) Epoch 5, batch 7700, loss[loss=0.3475, simple_loss=0.3832, pruned_loss=0.1559, over 21350.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3088, pruned_loss=0.08415, over 4295857.15 frames. ], batch size: 507, lr: 6.41e-03, grad_scale: 32.0 2023-06-21 13:16:05,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=778134.0, ans=0.125 2023-06-21 13:16:45,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-21 13:17:13,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=778254.0, ans=0.0 2023-06-21 13:17:52,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=778314.0, ans=0.0 2023-06-21 13:17:55,139 INFO [train.py:996] (0/4) Epoch 5, batch 7750, loss[loss=0.2943, simple_loss=0.3944, pruned_loss=0.0971, over 21645.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3129, pruned_loss=0.08252, over 4293632.92 frames. ], batch size: 414, lr: 6.41e-03, grad_scale: 16.0 2023-06-21 13:18:43,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.478e+02 2.745e+02 3.092e+02 4.488e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 13:18:56,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=778434.0, ans=0.125 2023-06-21 13:18:57,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=778494.0, ans=0.125 2023-06-21 13:19:35,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=778554.0, ans=0.125 2023-06-21 13:19:42,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-21 13:19:53,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=778614.0, ans=0.125 2023-06-21 13:20:24,283 INFO [train.py:996] (0/4) Epoch 5, batch 7800, loss[loss=0.2279, simple_loss=0.3004, pruned_loss=0.07771, over 21769.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3148, pruned_loss=0.08317, over 4282646.72 frames. ], batch size: 333, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:20:49,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=778734.0, ans=0.0 2023-06-21 13:21:06,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778734.0, ans=0.1 2023-06-21 13:21:12,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=778794.0, ans=0.0 2023-06-21 13:21:36,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778794.0, ans=0.1 2023-06-21 13:21:44,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=778854.0, ans=0.125 2023-06-21 13:21:44,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=778854.0, ans=0.04949747468305833 2023-06-21 13:21:59,828 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:22:29,714 INFO [train.py:996] (0/4) Epoch 5, batch 7850, loss[loss=0.2099, simple_loss=0.2704, pruned_loss=0.0747, over 21563.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3066, pruned_loss=0.08155, over 4281983.51 frames. ], batch size: 263, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:23:06,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.498e+02 2.821e+02 3.538e+02 6.560e+02, threshold=5.643e+02, percent-clipped=3.0 2023-06-21 13:24:04,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=779154.0, ans=0.125 2023-06-21 13:24:33,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=779214.0, ans=0.125 2023-06-21 13:24:41,376 INFO [train.py:996] (0/4) Epoch 5, batch 7900, loss[loss=0.2691, simple_loss=0.3643, pruned_loss=0.08695, over 21622.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3024, pruned_loss=0.08039, over 4272361.10 frames. ], batch size: 414, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:25:15,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=779274.0, ans=0.0 2023-06-21 13:26:00,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=779394.0, ans=0.125 2023-06-21 13:26:00,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779394.0, ans=0.1 2023-06-21 13:26:06,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779454.0, ans=0.1 2023-06-21 13:26:33,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=779514.0, ans=0.125 2023-06-21 13:27:14,824 INFO [train.py:996] (0/4) Epoch 5, batch 7950, loss[loss=0.2449, simple_loss=0.3185, pruned_loss=0.08561, over 21243.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3083, pruned_loss=0.08005, over 4272734.53 frames. ], batch size: 143, lr: 6.40e-03, grad_scale: 16.0 2023-06-21 13:27:18,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=779574.0, ans=0.0 2023-06-21 13:27:41,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=779634.0, ans=0.125 2023-06-21 13:27:42,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.594e+02 2.803e+02 3.768e+02 5.185e+02, threshold=5.606e+02, percent-clipped=0.0 2023-06-21 13:28:04,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-21 13:28:06,751 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:28:16,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=779754.0, ans=0.125 2023-06-21 13:28:16,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-21 13:28:36,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-21 13:29:26,306 INFO [train.py:996] (0/4) Epoch 5, batch 8000, loss[loss=0.2343, simple_loss=0.3054, pruned_loss=0.08166, over 21246.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3114, pruned_loss=0.08283, over 4267634.19 frames. ], batch size: 176, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:29:31,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=779874.0, ans=0.05 2023-06-21 13:30:38,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-21 13:30:39,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=779994.0, ans=0.125 2023-06-21 13:32:03,754 INFO [train.py:996] (0/4) Epoch 5, batch 8050, loss[loss=0.2714, simple_loss=0.3597, pruned_loss=0.09157, over 21666.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3142, pruned_loss=0.08333, over 4270852.73 frames. ], batch size: 389, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:32:18,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=780234.0, ans=0.0 2023-06-21 13:32:24,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.993e+02 3.544e+02 4.620e+02 9.797e+02, threshold=7.088e+02, percent-clipped=13.0 2023-06-21 13:32:49,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-21 13:33:23,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2023-06-21 13:33:48,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=780414.0, ans=0.04949747468305833 2023-06-21 13:34:12,510 INFO [train.py:996] (0/4) Epoch 5, batch 8100, loss[loss=0.2713, simple_loss=0.3225, pruned_loss=0.1101, over 21635.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3128, pruned_loss=0.0833, over 4273024.79 frames. ], batch size: 471, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:34:13,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=780474.0, ans=0.1 2023-06-21 13:34:14,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-21 13:35:04,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=780534.0, ans=0.1 2023-06-21 13:35:09,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=780534.0, ans=0.0 2023-06-21 13:35:49,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.66 vs. limit=22.5 2023-06-21 13:35:54,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=780654.0, ans=0.0 2023-06-21 13:36:21,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=780714.0, ans=0.04949747468305833 2023-06-21 13:36:57,694 INFO [train.py:996] (0/4) Epoch 5, batch 8150, loss[loss=0.2647, simple_loss=0.38, pruned_loss=0.0747, over 21175.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3188, pruned_loss=0.08389, over 4269016.14 frames. ], batch size: 548, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:37:42,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.573e+02 2.921e+02 3.508e+02 5.879e+02, threshold=5.842e+02, percent-clipped=0.0 2023-06-21 13:38:35,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=780954.0, ans=0.025 2023-06-21 13:39:08,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-21 13:39:08,747 INFO [train.py:996] (0/4) Epoch 5, batch 8200, loss[loss=0.223, simple_loss=0.2862, pruned_loss=0.07987, over 21549.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3125, pruned_loss=0.08117, over 4261599.83 frames. ], batch size: 391, lr: 6.40e-03, grad_scale: 32.0 2023-06-21 13:39:40,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-21 13:39:51,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=781134.0, ans=0.2 2023-06-21 13:40:22,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-21 13:40:57,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-21 13:41:33,483 INFO [train.py:996] (0/4) Epoch 5, batch 8250, loss[loss=0.2528, simple_loss=0.3436, pruned_loss=0.08097, over 21809.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3105, pruned_loss=0.08096, over 4252754.78 frames. ], batch size: 371, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:41:44,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=781374.0, ans=0.04949747468305833 2023-06-21 13:42:06,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.500e+02 2.988e+02 3.537e+02 7.334e+02, threshold=5.975e+02, percent-clipped=1.0 2023-06-21 13:42:19,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=781434.0, ans=0.125 2023-06-21 13:42:48,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-21 13:43:21,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=781554.0, ans=0.0 2023-06-21 13:43:23,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=781554.0, ans=0.125 2023-06-21 13:43:46,531 INFO [train.py:996] (0/4) Epoch 5, batch 8300, loss[loss=0.1771, simple_loss=0.2558, pruned_loss=0.04918, over 21385.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3072, pruned_loss=0.07862, over 4247276.29 frames. ], batch size: 194, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:44:00,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 13:45:44,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=781914.0, ans=0.1 2023-06-21 13:46:00,686 INFO [train.py:996] (0/4) Epoch 5, batch 8350, loss[loss=0.2374, simple_loss=0.3117, pruned_loss=0.08155, over 21588.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3067, pruned_loss=0.07706, over 4256856.83 frames. ], batch size: 391, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:46:11,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=781974.0, ans=0.125 2023-06-21 13:46:41,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 13:46:44,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=782034.0, ans=0.1 2023-06-21 13:46:45,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.563e+02 2.962e+02 3.725e+02 6.327e+02, threshold=5.925e+02, percent-clipped=2.0 2023-06-21 13:47:14,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-21 13:47:25,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=782154.0, ans=0.04949747468305833 2023-06-21 13:47:38,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-21 13:47:44,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=22.5 2023-06-21 13:48:19,760 INFO [train.py:996] (0/4) Epoch 5, batch 8400, loss[loss=0.2168, simple_loss=0.2942, pruned_loss=0.06971, over 21781.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3049, pruned_loss=0.07473, over 4261685.40 frames. ], batch size: 317, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:48:45,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782334.0, ans=0.1 2023-06-21 13:49:37,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=782454.0, ans=0.2 2023-06-21 13:50:17,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=782514.0, ans=0.125 2023-06-21 13:50:33,100 INFO [train.py:996] (0/4) Epoch 5, batch 8450, loss[loss=0.2253, simple_loss=0.2786, pruned_loss=0.08604, over 20763.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3022, pruned_loss=0.07418, over 4261644.56 frames. ], batch size: 609, lr: 6.39e-03, grad_scale: 32.0 2023-06-21 13:50:53,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=782634.0, ans=0.1 2023-06-21 13:51:01,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.311e+02 2.681e+02 3.207e+02 6.839e+02, threshold=5.362e+02, percent-clipped=1.0 2023-06-21 13:51:03,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=782634.0, ans=0.125 2023-06-21 13:51:05,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=782634.0, ans=0.125 2023-06-21 13:51:16,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=22.5 2023-06-21 13:51:35,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 13:52:30,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=782874.0, ans=0.07 2023-06-21 13:52:31,598 INFO [train.py:996] (0/4) Epoch 5, batch 8500, loss[loss=0.1947, simple_loss=0.2573, pruned_loss=0.06605, over 21473.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2984, pruned_loss=0.07528, over 4262623.79 frames. ], batch size: 212, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:53:06,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=782994.0, ans=0.125 2023-06-21 13:53:51,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=783054.0, ans=0.2 2023-06-21 13:54:35,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=783114.0, ans=0.125 2023-06-21 13:54:41,554 INFO [train.py:996] (0/4) Epoch 5, batch 8550, loss[loss=0.2596, simple_loss=0.3398, pruned_loss=0.08964, over 21734.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3031, pruned_loss=0.07842, over 4272386.48 frames. ], batch size: 351, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:55:07,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=783234.0, ans=0.125 2023-06-21 13:55:31,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.687e+02 3.119e+02 3.831e+02 5.921e+02, threshold=6.237e+02, percent-clipped=6.0 2023-06-21 13:55:36,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=783294.0, ans=0.125 2023-06-21 13:55:38,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=783294.0, ans=0.125 2023-06-21 13:55:47,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.34 vs. limit=15.0 2023-06-21 13:56:38,382 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:57:03,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.93 vs. limit=5.0 2023-06-21 13:57:03,916 INFO [train.py:996] (0/4) Epoch 5, batch 8600, loss[loss=0.2591, simple_loss=0.3357, pruned_loss=0.09128, over 21639.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3101, pruned_loss=0.08025, over 4273248.76 frames. ], batch size: 389, lr: 6.39e-03, grad_scale: 16.0 2023-06-21 13:57:05,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=783474.0, ans=0.125 2023-06-21 13:57:15,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=783474.0, ans=12.0 2023-06-21 13:59:02,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-21 13:59:20,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=783714.0, ans=10.0 2023-06-21 13:59:28,622 INFO [train.py:996] (0/4) Epoch 5, batch 8650, loss[loss=0.1841, simple_loss=0.2807, pruned_loss=0.04371, over 21758.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3158, pruned_loss=0.08032, over 4271982.99 frames. ], batch size: 332, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 13:59:29,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783774.0, ans=0.1 2023-06-21 13:59:35,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=783774.0, ans=0.1 2023-06-21 13:59:54,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-21 14:00:03,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.535e+02 2.923e+02 3.243e+02 4.542e+02, threshold=5.846e+02, percent-clipped=0.0 2023-06-21 14:01:20,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=784014.0, ans=0.125 2023-06-21 14:01:20,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=784014.0, ans=0.125 2023-06-21 14:01:27,562 INFO [train.py:996] (0/4) Epoch 5, batch 8700, loss[loss=0.19, simple_loss=0.2569, pruned_loss=0.06157, over 21421.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.308, pruned_loss=0.07836, over 4260129.93 frames. ], batch size: 131, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 14:01:51,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=784134.0, ans=0.2 2023-06-21 14:03:12,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=784314.0, ans=15.0 2023-06-21 14:03:36,315 INFO [train.py:996] (0/4) Epoch 5, batch 8750, loss[loss=0.2208, simple_loss=0.2836, pruned_loss=0.07897, over 21575.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3047, pruned_loss=0.0791, over 4266532.13 frames. ], batch size: 195, lr: 6.38e-03, grad_scale: 16.0 2023-06-21 14:03:41,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=784374.0, ans=0.2 2023-06-21 14:03:46,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-21 14:03:58,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=784434.0, ans=0.0 2023-06-21 14:04:08,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.564e+02 3.026e+02 3.702e+02 5.969e+02, threshold=6.051e+02, percent-clipped=2.0 2023-06-21 14:04:43,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=784494.0, ans=0.0 2023-06-21 14:05:26,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=784554.0, ans=0.035 2023-06-21 14:06:02,650 INFO [train.py:996] (0/4) Epoch 5, batch 8800, loss[loss=0.2572, simple_loss=0.3119, pruned_loss=0.1012, over 20242.00 frames. ], tot_loss[loss=0.238, simple_loss=0.312, pruned_loss=0.08204, over 4267357.21 frames. ], batch size: 707, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:06:03,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=784674.0, ans=0.125 2023-06-21 14:06:27,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=784674.0, ans=0.0 2023-06-21 14:06:56,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-21 14:06:57,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=784734.0, ans=0.0 2023-06-21 14:07:00,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=12.0 2023-06-21 14:07:58,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.42 vs. limit=22.5 2023-06-21 14:08:15,089 INFO [train.py:996] (0/4) Epoch 5, batch 8850, loss[loss=0.2603, simple_loss=0.3264, pruned_loss=0.09708, over 21336.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3219, pruned_loss=0.08454, over 4258659.52 frames. ], batch size: 471, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:08:31,352 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:08:53,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=785034.0, ans=0.2 2023-06-21 14:08:56,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.643e+02 2.984e+02 3.583e+02 6.158e+02, threshold=5.968e+02, percent-clipped=1.0 2023-06-21 14:10:12,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-21 14:10:42,214 INFO [train.py:996] (0/4) Epoch 5, batch 8900, loss[loss=0.2326, simple_loss=0.3049, pruned_loss=0.08008, over 21226.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3175, pruned_loss=0.0836, over 4260624.81 frames. ], batch size: 548, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:10:44,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785274.0, ans=0.1 2023-06-21 14:12:50,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=785514.0, ans=0.125 2023-06-21 14:12:52,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-21 14:13:06,510 INFO [train.py:996] (0/4) Epoch 5, batch 8950, loss[loss=0.2302, simple_loss=0.2961, pruned_loss=0.08215, over 21664.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.319, pruned_loss=0.0839, over 4264358.97 frames. ], batch size: 298, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:13:09,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=785574.0, ans=0.0 2023-06-21 14:13:48,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785634.0, ans=0.1 2023-06-21 14:13:49,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=785634.0, ans=0.125 2023-06-21 14:13:51,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=785634.0, ans=0.5 2023-06-21 14:13:51,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-21 14:13:52,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.821e+02 3.330e+02 4.267e+02 7.491e+02, threshold=6.660e+02, percent-clipped=6.0 2023-06-21 14:14:09,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=785694.0, ans=0.0 2023-06-21 14:14:18,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=785754.0, ans=0.0 2023-06-21 14:15:10,355 INFO [train.py:996] (0/4) Epoch 5, batch 9000, loss[loss=0.2518, simple_loss=0.3005, pruned_loss=0.1015, over 21240.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3127, pruned_loss=0.08283, over 4260236.78 frames. ], batch size: 471, lr: 6.38e-03, grad_scale: 32.0 2023-06-21 14:15:10,357 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 14:16:12,919 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2688, simple_loss=0.3596, pruned_loss=0.08904, over 1796401.00 frames. 2023-06-21 14:16:12,921 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 14:16:30,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=785874.0, ans=0.125 2023-06-21 14:16:56,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=785994.0, ans=0.0 2023-06-21 14:17:49,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=786114.0, ans=0.1 2023-06-21 14:18:19,062 INFO [train.py:996] (0/4) Epoch 5, batch 9050, loss[loss=0.2374, simple_loss=0.3136, pruned_loss=0.0806, over 21889.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3078, pruned_loss=0.07957, over 4262863.64 frames. ], batch size: 372, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:18:31,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=786174.0, ans=0.125 2023-06-21 14:18:33,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-21 14:18:53,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=786234.0, ans=0.5 2023-06-21 14:19:10,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.633e+02 2.951e+02 3.520e+02 5.679e+02, threshold=5.902e+02, percent-clipped=0.0 2023-06-21 14:19:40,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=15.0 2023-06-21 14:19:46,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=786354.0, ans=0.125 2023-06-21 14:20:38,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=786414.0, ans=0.125 2023-06-21 14:20:50,345 INFO [train.py:996] (0/4) Epoch 5, batch 9100, loss[loss=0.2555, simple_loss=0.3472, pruned_loss=0.08187, over 21597.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3146, pruned_loss=0.08187, over 4263580.14 frames. ], batch size: 389, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:20:52,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=786474.0, ans=0.125 2023-06-21 14:21:45,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=786594.0, ans=0.0 2023-06-21 14:22:19,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-21 14:22:31,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=786654.0, ans=0.0 2023-06-21 14:23:07,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=786714.0, ans=0.125 2023-06-21 14:23:17,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-21 14:23:17,968 INFO [train.py:996] (0/4) Epoch 5, batch 9150, loss[loss=0.2794, simple_loss=0.3753, pruned_loss=0.09178, over 21305.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3153, pruned_loss=0.07942, over 4259642.55 frames. ], batch size: 548, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:23:34,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 14:23:53,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.515e+02 3.021e+02 3.547e+02 6.572e+02, threshold=6.043e+02, percent-clipped=1.0 2023-06-21 14:24:29,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=15.0 2023-06-21 14:24:45,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-21 14:25:26,982 INFO [train.py:996] (0/4) Epoch 5, batch 9200, loss[loss=0.2466, simple_loss=0.3342, pruned_loss=0.0795, over 21660.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.318, pruned_loss=0.07873, over 4261576.47 frames. ], batch size: 441, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:25:31,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.57 vs. limit=22.5 2023-06-21 14:26:05,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=787134.0, ans=0.125 2023-06-21 14:27:45,227 INFO [train.py:996] (0/4) Epoch 5, batch 9250, loss[loss=0.2432, simple_loss=0.3102, pruned_loss=0.08811, over 21776.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3207, pruned_loss=0.08141, over 4266850.73 frames. ], batch size: 124, lr: 6.37e-03, grad_scale: 32.0 2023-06-21 14:27:52,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=787374.0, ans=0.1 2023-06-21 14:28:01,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=8.0 2023-06-21 14:28:28,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.551e+02 3.044e+02 3.631e+02 5.502e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 14:28:38,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-21 14:30:00,545 INFO [train.py:996] (0/4) Epoch 5, batch 9300, loss[loss=0.2418, simple_loss=0.3368, pruned_loss=0.07337, over 21767.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3162, pruned_loss=0.08187, over 4265255.79 frames. ], batch size: 351, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:31:56,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-06-21 14:32:22,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=787914.0, ans=0.125 2023-06-21 14:32:26,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=787974.0, ans=0.125 2023-06-21 14:32:27,965 INFO [train.py:996] (0/4) Epoch 5, batch 9350, loss[loss=0.2588, simple_loss=0.3439, pruned_loss=0.08683, over 21842.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3219, pruned_loss=0.08264, over 4268686.58 frames. ], batch size: 118, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:32:29,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=787974.0, ans=0.125 2023-06-21 14:32:31,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=787974.0, ans=0.125 2023-06-21 14:32:32,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-21 14:32:33,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-21 14:32:39,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=787974.0, ans=0.125 2023-06-21 14:33:20,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-21 14:33:20,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.848e+02 3.302e+02 4.167e+02 7.769e+02, threshold=6.603e+02, percent-clipped=5.0 2023-06-21 14:33:28,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=788034.0, ans=0.0 2023-06-21 14:33:48,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=788094.0, ans=0.125 2023-06-21 14:34:22,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=788214.0, ans=0.0 2023-06-21 14:34:49,352 INFO [train.py:996] (0/4) Epoch 5, batch 9400, loss[loss=0.2713, simple_loss=0.309, pruned_loss=0.1168, over 21421.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3226, pruned_loss=0.08324, over 4271339.67 frames. ], batch size: 510, lr: 6.37e-03, grad_scale: 16.0 2023-06-21 14:37:20,225 INFO [train.py:996] (0/4) Epoch 5, batch 9450, loss[loss=0.2058, simple_loss=0.269, pruned_loss=0.07132, over 21712.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3132, pruned_loss=0.08192, over 4270186.82 frames. ], batch size: 334, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:37:38,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=788634.0, ans=0.125 2023-06-21 14:37:41,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-21 14:37:45,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.567e+02 2.945e+02 3.778e+02 6.288e+02, threshold=5.890e+02, percent-clipped=0.0 2023-06-21 14:38:47,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788754.0, ans=0.1 2023-06-21 14:38:59,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=788814.0, ans=0.125 2023-06-21 14:39:23,910 INFO [train.py:996] (0/4) Epoch 5, batch 9500, loss[loss=0.2367, simple_loss=0.3031, pruned_loss=0.08519, over 21786.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.306, pruned_loss=0.07976, over 4263519.44 frames. ], batch size: 118, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:40:11,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-21 14:40:31,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=788994.0, ans=0.125 2023-06-21 14:41:45,118 INFO [train.py:996] (0/4) Epoch 5, batch 9550, loss[loss=0.2833, simple_loss=0.3692, pruned_loss=0.09864, over 21622.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3103, pruned_loss=0.08203, over 4264262.29 frames. ], batch size: 389, lr: 6.36e-03, grad_scale: 16.0 2023-06-21 14:42:09,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.687e+02 3.290e+02 3.942e+02 9.010e+02, threshold=6.580e+02, percent-clipped=4.0 2023-06-21 14:44:03,882 INFO [train.py:996] (0/4) Epoch 5, batch 9600, loss[loss=0.2682, simple_loss=0.3841, pruned_loss=0.07617, over 20818.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3136, pruned_loss=0.08366, over 4270399.47 frames. ], batch size: 607, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:44:19,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=789534.0, ans=0.2 2023-06-21 14:46:21,980 INFO [train.py:996] (0/4) Epoch 5, batch 9650, loss[loss=0.2442, simple_loss=0.3151, pruned_loss=0.08661, over 21933.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3125, pruned_loss=0.08314, over 4276490.08 frames. ], batch size: 316, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:46:43,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=789774.0, ans=0.125 2023-06-21 14:47:15,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.578e+02 2.899e+02 3.353e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-21 14:48:25,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=789954.0, ans=0.0 2023-06-21 14:48:28,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=790014.0, ans=0.035 2023-06-21 14:48:44,246 INFO [train.py:996] (0/4) Epoch 5, batch 9700, loss[loss=0.2066, simple_loss=0.2816, pruned_loss=0.06584, over 20047.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3154, pruned_loss=0.08336, over 4277382.74 frames. ], batch size: 703, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:48:44,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=790074.0, ans=0.125 2023-06-21 14:48:45,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-21 14:48:56,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-21 14:49:56,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=790194.0, ans=0.0 2023-06-21 14:50:39,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=790314.0, ans=0.0 2023-06-21 14:50:39,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-21 14:50:41,660 INFO [train.py:996] (0/4) Epoch 5, batch 9750, loss[loss=0.1953, simple_loss=0.2544, pruned_loss=0.06804, over 21187.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3081, pruned_loss=0.08204, over 4265983.53 frames. ], batch size: 159, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:51:24,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=790434.0, ans=0.125 2023-06-21 14:51:25,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.430e+02 2.788e+02 3.260e+02 5.197e+02, threshold=5.575e+02, percent-clipped=0.0 2023-06-21 14:52:00,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-21 14:52:00,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=790554.0, ans=0.1 2023-06-21 14:52:03,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=790554.0, ans=0.125 2023-06-21 14:52:45,336 INFO [train.py:996] (0/4) Epoch 5, batch 9800, loss[loss=0.2125, simple_loss=0.2819, pruned_loss=0.07153, over 21660.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3087, pruned_loss=0.08216, over 4254396.67 frames. ], batch size: 263, lr: 6.36e-03, grad_scale: 32.0 2023-06-21 14:52:45,788 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:54:11,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-21 14:54:24,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=790914.0, ans=0.0 2023-06-21 14:54:44,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=790914.0, ans=0.125 2023-06-21 14:54:51,741 INFO [train.py:996] (0/4) Epoch 5, batch 9850, loss[loss=0.2153, simple_loss=0.2817, pruned_loss=0.07443, over 21717.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3058, pruned_loss=0.08243, over 4251188.44 frames. ], batch size: 112, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:55:07,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=790974.0, ans=0.125 2023-06-21 14:55:42,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.409e+02 2.659e+02 3.112e+02 4.458e+02, threshold=5.319e+02, percent-clipped=0.0 2023-06-21 14:55:50,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-21 14:56:34,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=791214.0, ans=0.125 2023-06-21 14:57:01,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=791274.0, ans=0.125 2023-06-21 14:57:02,492 INFO [train.py:996] (0/4) Epoch 5, batch 9900, loss[loss=0.2623, simple_loss=0.3403, pruned_loss=0.09216, over 19795.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3015, pruned_loss=0.08149, over 4238983.74 frames. ], batch size: 702, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:58:54,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=791514.0, ans=0.2 2023-06-21 14:59:03,121 INFO [train.py:996] (0/4) Epoch 5, batch 9950, loss[loss=0.2205, simple_loss=0.2796, pruned_loss=0.08068, over 21878.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3038, pruned_loss=0.08342, over 4245120.45 frames. ], batch size: 317, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 14:59:38,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=791634.0, ans=0.2 2023-06-21 15:00:12,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.669e+02 3.087e+02 3.517e+02 5.049e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-21 15:00:35,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791694.0, ans=0.1 2023-06-21 15:01:24,098 INFO [train.py:996] (0/4) Epoch 5, batch 10000, loss[loss=0.3262, simple_loss=0.3704, pruned_loss=0.1409, over 21431.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3007, pruned_loss=0.08274, over 4257995.00 frames. ], batch size: 509, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:02:25,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=791934.0, ans=0.125 2023-06-21 15:02:36,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=791994.0, ans=0.0 2023-06-21 15:02:38,047 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-132000.pt 2023-06-21 15:03:11,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=792054.0, ans=0.0 2023-06-21 15:03:13,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-21 15:03:30,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=792114.0, ans=0.125 2023-06-21 15:03:34,547 INFO [train.py:996] (0/4) Epoch 5, batch 10050, loss[loss=0.1897, simple_loss=0.2657, pruned_loss=0.05688, over 21710.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3033, pruned_loss=0.08297, over 4262065.82 frames. ], batch size: 282, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:03:52,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=792174.0, ans=0.0 2023-06-21 15:04:18,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=792234.0, ans=0.05 2023-06-21 15:04:26,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=792234.0, ans=0.2 2023-06-21 15:04:41,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.500e+02 2.849e+02 3.392e+02 5.365e+02, threshold=5.698e+02, percent-clipped=0.0 2023-06-21 15:04:47,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-21 15:05:02,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-21 15:05:11,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=792354.0, ans=0.0 2023-06-21 15:05:20,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=792414.0, ans=0.125 2023-06-21 15:05:32,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=792414.0, ans=0.5 2023-06-21 15:05:35,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-21 15:06:11,526 INFO [train.py:996] (0/4) Epoch 5, batch 10100, loss[loss=0.2713, simple_loss=0.3446, pruned_loss=0.09899, over 21647.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2998, pruned_loss=0.08105, over 4263918.28 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:06:14,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-21 15:08:08,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=792714.0, ans=0.025 2023-06-21 15:08:30,814 INFO [train.py:996] (0/4) Epoch 5, batch 10150, loss[loss=0.2263, simple_loss=0.3022, pruned_loss=0.07517, over 21684.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3057, pruned_loss=0.0831, over 4263991.99 frames. ], batch size: 112, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:09:16,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.586e+02 2.981e+02 3.713e+02 5.514e+02, threshold=5.962e+02, percent-clipped=0.0 2023-06-21 15:10:13,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=792954.0, ans=0.0 2023-06-21 15:10:18,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-21 15:10:42,455 INFO [train.py:996] (0/4) Epoch 5, batch 10200, loss[loss=0.2, simple_loss=0.2856, pruned_loss=0.0572, over 21681.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3047, pruned_loss=0.08038, over 4265489.68 frames. ], batch size: 298, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:11:12,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=793074.0, ans=0.0 2023-06-21 15:11:40,978 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:12:00,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793254.0, ans=0.1 2023-06-21 15:12:48,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=793314.0, ans=0.125 2023-06-21 15:12:58,752 INFO [train.py:996] (0/4) Epoch 5, batch 10250, loss[loss=0.178, simple_loss=0.2719, pruned_loss=0.04205, over 21592.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2992, pruned_loss=0.07428, over 4271179.45 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-21 15:13:38,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-21 15:13:43,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 2.089e+02 2.610e+02 3.127e+02 4.884e+02, threshold=5.220e+02, percent-clipped=0.0 2023-06-21 15:14:43,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=793614.0, ans=0.0 2023-06-21 15:15:13,521 INFO [train.py:996] (0/4) Epoch 5, batch 10300, loss[loss=0.3039, simple_loss=0.3774, pruned_loss=0.1152, over 21471.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3031, pruned_loss=0.07565, over 4279760.56 frames. ], batch size: 471, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:16:27,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793794.0, ans=0.1 2023-06-21 15:17:41,409 INFO [train.py:996] (0/4) Epoch 5, batch 10350, loss[loss=0.1999, simple_loss=0.2747, pruned_loss=0.06259, over 21701.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3042, pruned_loss=0.07521, over 4279807.34 frames. ], batch size: 247, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:17:51,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=793974.0, ans=0.2 2023-06-21 15:18:08,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.871e+02 3.499e+02 4.355e+02 9.193e+02, threshold=6.998e+02, percent-clipped=17.0 2023-06-21 15:18:50,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=794094.0, ans=0.0 2023-06-21 15:18:59,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=794154.0, ans=0.125 2023-06-21 15:19:08,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=794154.0, ans=0.1 2023-06-21 15:19:08,313 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:19:27,456 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:19:48,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=794274.0, ans=0.125 2023-06-21 15:19:49,805 INFO [train.py:996] (0/4) Epoch 5, batch 10400, loss[loss=0.1641, simple_loss=0.2098, pruned_loss=0.0592, over 21107.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2997, pruned_loss=0.07448, over 4276041.62 frames. ], batch size: 143, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:20:03,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=794334.0, ans=0.04949747468305833 2023-06-21 15:21:50,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-21 15:22:04,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=794574.0, ans=0.1 2023-06-21 15:22:05,753 INFO [train.py:996] (0/4) Epoch 5, batch 10450, loss[loss=0.2434, simple_loss=0.3251, pruned_loss=0.08088, over 21700.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3034, pruned_loss=0.07732, over 4270335.03 frames. ], batch size: 298, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:23:13,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.562e+02 2.841e+02 3.622e+02 6.027e+02, threshold=5.681e+02, percent-clipped=0.0 2023-06-21 15:23:35,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-21 15:24:06,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=794814.0, ans=0.0 2023-06-21 15:24:22,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=794814.0, ans=0.0 2023-06-21 15:24:27,794 INFO [train.py:996] (0/4) Epoch 5, batch 10500, loss[loss=0.2321, simple_loss=0.2897, pruned_loss=0.08727, over 21764.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3034, pruned_loss=0.07711, over 4267913.88 frames. ], batch size: 112, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:25:18,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=794934.0, ans=0.125 2023-06-21 15:25:53,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=795054.0, ans=0.0 2023-06-21 15:26:36,384 INFO [train.py:996] (0/4) Epoch 5, batch 10550, loss[loss=0.2042, simple_loss=0.2564, pruned_loss=0.07604, over 21281.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2981, pruned_loss=0.0769, over 4253288.36 frames. ], batch size: 551, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:27:02,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=795234.0, ans=0.125 2023-06-21 15:27:03,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=795234.0, ans=0.0 2023-06-21 15:27:31,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.401e+02 2.781e+02 3.246e+02 4.477e+02, threshold=5.561e+02, percent-clipped=0.0 2023-06-21 15:27:31,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=795234.0, ans=0.2 2023-06-21 15:28:25,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795414.0, ans=0.125 2023-06-21 15:28:28,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-21 15:28:36,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=795414.0, ans=0.2 2023-06-21 15:28:39,020 INFO [train.py:996] (0/4) Epoch 5, batch 10600, loss[loss=0.1959, simple_loss=0.2641, pruned_loss=0.0639, over 21901.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2942, pruned_loss=0.07579, over 4255490.70 frames. ], batch size: 98, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:30:50,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-21 15:31:11,933 INFO [train.py:996] (0/4) Epoch 5, batch 10650, loss[loss=0.214, simple_loss=0.3123, pruned_loss=0.05789, over 21173.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2969, pruned_loss=0.07491, over 4255241.20 frames. ], batch size: 548, lr: 6.34e-03, grad_scale: 32.0 2023-06-21 15:31:22,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795774.0, ans=0.1 2023-06-21 15:31:35,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=795774.0, ans=0.2 2023-06-21 15:32:11,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.303e+02 2.833e+02 3.261e+02 4.754e+02, threshold=5.666e+02, percent-clipped=0.0 2023-06-21 15:32:38,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795954.0, ans=0.1 2023-06-21 15:32:58,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=796014.0, ans=0.125 2023-06-21 15:32:58,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=796014.0, ans=0.1 2023-06-21 15:33:24,933 INFO [train.py:996] (0/4) Epoch 5, batch 10700, loss[loss=0.2319, simple_loss=0.2986, pruned_loss=0.08258, over 21712.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2948, pruned_loss=0.07436, over 4261599.64 frames. ], batch size: 247, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:34:05,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=796134.0, ans=0.05 2023-06-21 15:34:14,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=796134.0, ans=0.2 2023-06-21 15:34:15,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-21 15:34:48,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=796254.0, ans=0.2 2023-06-21 15:34:54,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=796254.0, ans=0.0 2023-06-21 15:35:54,017 INFO [train.py:996] (0/4) Epoch 5, batch 10750, loss[loss=0.2744, simple_loss=0.3466, pruned_loss=0.1011, over 21787.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3066, pruned_loss=0.07867, over 4263681.53 frames. ], batch size: 124, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:36:39,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.624e+02 2.949e+02 3.817e+02 5.681e+02, threshold=5.899e+02, percent-clipped=1.0 2023-06-21 15:36:52,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-21 15:37:08,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=796494.0, ans=0.125 2023-06-21 15:38:23,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=796614.0, ans=0.5 2023-06-21 15:38:28,030 INFO [train.py:996] (0/4) Epoch 5, batch 10800, loss[loss=0.2644, simple_loss=0.3341, pruned_loss=0.09735, over 21445.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3126, pruned_loss=0.07977, over 4263908.23 frames. ], batch size: 194, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:38:29,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=796674.0, ans=0.125 2023-06-21 15:38:46,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-21 15:38:52,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=796734.0, ans=0.0 2023-06-21 15:39:05,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=796734.0, ans=0.0 2023-06-21 15:40:47,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=796974.0, ans=0.125 2023-06-21 15:40:48,668 INFO [train.py:996] (0/4) Epoch 5, batch 10850, loss[loss=0.2491, simple_loss=0.2899, pruned_loss=0.1042, over 21472.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3124, pruned_loss=0.08064, over 4262476.04 frames. ], batch size: 511, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:41:10,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=797034.0, ans=0.125 2023-06-21 15:41:17,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=797034.0, ans=0.125 2023-06-21 15:41:27,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.616e+02 2.809e+02 3.256e+02 4.598e+02, threshold=5.618e+02, percent-clipped=0.0 2023-06-21 15:41:41,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=797094.0, ans=0.05 2023-06-21 15:41:51,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=15.0 2023-06-21 15:42:26,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=797154.0, ans=0.125 2023-06-21 15:43:00,596 INFO [train.py:996] (0/4) Epoch 5, batch 10900, loss[loss=0.197, simple_loss=0.2655, pruned_loss=0.06421, over 21404.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3043, pruned_loss=0.07815, over 4258096.81 frames. ], batch size: 211, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:43:51,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=797394.0, ans=0.0 2023-06-21 15:44:01,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=797394.0, ans=0.0 2023-06-21 15:44:35,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=797454.0, ans=0.2 2023-06-21 15:45:02,683 INFO [train.py:996] (0/4) Epoch 5, batch 10950, loss[loss=0.1991, simple_loss=0.2668, pruned_loss=0.06568, over 21109.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2999, pruned_loss=0.07567, over 4263119.35 frames. ], batch size: 143, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:45:13,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797574.0, ans=0.1 2023-06-21 15:45:32,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=797634.0, ans=0.125 2023-06-21 15:45:44,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.495e+02 2.960e+02 3.280e+02 5.814e+02, threshold=5.920e+02, percent-clipped=2.0 2023-06-21 15:45:51,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=797694.0, ans=0.125 2023-06-21 15:45:55,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=797694.0, ans=0.0 2023-06-21 15:46:36,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=797754.0, ans=0.0 2023-06-21 15:46:36,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=797754.0, ans=10.0 2023-06-21 15:47:07,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797814.0, ans=0.1 2023-06-21 15:47:13,637 INFO [train.py:996] (0/4) Epoch 5, batch 11000, loss[loss=0.2451, simple_loss=0.3123, pruned_loss=0.08896, over 21355.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2991, pruned_loss=0.07699, over 4260266.44 frames. ], batch size: 159, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:47:56,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797934.0, ans=0.125 2023-06-21 15:48:07,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-21 15:48:10,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=797934.0, ans=0.5 2023-06-21 15:49:18,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=798114.0, ans=0.2 2023-06-21 15:49:31,654 INFO [train.py:996] (0/4) Epoch 5, batch 11050, loss[loss=0.2422, simple_loss=0.362, pruned_loss=0.06123, over 20779.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2974, pruned_loss=0.07804, over 4264466.67 frames. ], batch size: 607, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:49:59,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=798234.0, ans=0.125 2023-06-21 15:50:04,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=798234.0, ans=0.2 2023-06-21 15:50:09,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-21 15:50:25,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.567e+02 2.755e+02 3.188e+02 5.366e+02, threshold=5.510e+02, percent-clipped=0.0 2023-06-21 15:51:36,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=798414.0, ans=0.0 2023-06-21 15:51:42,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798414.0, ans=0.1 2023-06-21 15:51:44,982 INFO [train.py:996] (0/4) Epoch 5, batch 11100, loss[loss=0.2101, simple_loss=0.2814, pruned_loss=0.06937, over 21621.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2959, pruned_loss=0.07794, over 4266112.59 frames. ], batch size: 298, lr: 6.33e-03, grad_scale: 32.0 2023-06-21 15:52:48,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-21 15:53:59,001 INFO [train.py:996] (0/4) Epoch 5, batch 11150, loss[loss=0.2168, simple_loss=0.2973, pruned_loss=0.06812, over 21247.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.294, pruned_loss=0.07709, over 4254386.43 frames. ], batch size: 143, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:54:02,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=798774.0, ans=0.2 2023-06-21 15:54:40,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=798834.0, ans=0.0 2023-06-21 15:54:41,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=798834.0, ans=0.1 2023-06-21 15:54:52,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.453e+02 2.788e+02 3.224e+02 5.463e+02, threshold=5.576e+02, percent-clipped=0.0 2023-06-21 15:55:04,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798894.0, ans=0.1 2023-06-21 15:56:06,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=799014.0, ans=0.04949747468305833 2023-06-21 15:56:13,271 INFO [train.py:996] (0/4) Epoch 5, batch 11200, loss[loss=0.2108, simple_loss=0.2781, pruned_loss=0.07176, over 21483.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2922, pruned_loss=0.07664, over 4259306.51 frames. ], batch size: 389, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:56:30,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=799074.0, ans=0.125 2023-06-21 15:56:54,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=799134.0, ans=0.2 2023-06-21 15:57:06,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-21 15:57:28,431 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:58:25,679 INFO [train.py:996] (0/4) Epoch 5, batch 11250, loss[loss=0.223, simple_loss=0.3041, pruned_loss=0.07091, over 21653.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.293, pruned_loss=0.07712, over 4256301.96 frames. ], batch size: 391, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 15:59:15,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.425e+02 2.726e+02 3.130e+02 5.032e+02, threshold=5.452e+02, percent-clipped=0.0 2023-06-21 15:59:15,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=799434.0, ans=0.125 2023-06-21 15:59:41,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799494.0, ans=0.1 2023-06-21 16:00:08,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=799614.0, ans=0.0 2023-06-21 16:00:34,386 INFO [train.py:996] (0/4) Epoch 5, batch 11300, loss[loss=0.2135, simple_loss=0.3022, pruned_loss=0.06239, over 21765.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2947, pruned_loss=0.07735, over 4256545.61 frames. ], batch size: 316, lr: 6.32e-03, grad_scale: 32.0 2023-06-21 16:00:36,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=799674.0, ans=0.2 2023-06-21 16:01:16,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-06-21 16:01:25,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=799734.0, ans=0.125 2023-06-21 16:01:54,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=799854.0, ans=0.0 2023-06-21 16:02:30,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=799914.0, ans=0.0 2023-06-21 16:02:50,117 INFO [train.py:996] (0/4) Epoch 5, batch 11350, loss[loss=0.2057, simple_loss=0.2828, pruned_loss=0.06431, over 21273.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2968, pruned_loss=0.07678, over 4264295.67 frames. ], batch size: 143, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:03:46,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=800034.0, ans=0.125 2023-06-21 16:03:52,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.490e+02 2.826e+02 3.230e+02 4.921e+02, threshold=5.651e+02, percent-clipped=0.0 2023-06-21 16:04:52,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800154.0, ans=0.1 2023-06-21 16:05:16,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-21 16:05:18,316 INFO [train.py:996] (0/4) Epoch 5, batch 11400, loss[loss=0.2827, simple_loss=0.3613, pruned_loss=0.102, over 21627.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3041, pruned_loss=0.08015, over 4271722.85 frames. ], batch size: 441, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:05:40,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=15.0 2023-06-21 16:05:42,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=800274.0, ans=0.035 2023-06-21 16:06:06,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=800334.0, ans=0.2 2023-06-21 16:06:09,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=800334.0, ans=0.125 2023-06-21 16:06:34,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800394.0, ans=0.1 2023-06-21 16:06:55,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=800394.0, ans=0.0 2023-06-21 16:06:55,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=800394.0, ans=0.0 2023-06-21 16:07:11,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-21 16:07:37,423 INFO [train.py:996] (0/4) Epoch 5, batch 11450, loss[loss=0.2571, simple_loss=0.3317, pruned_loss=0.09125, over 21570.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3045, pruned_loss=0.07846, over 4273697.74 frames. ], batch size: 389, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:08:18,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=800634.0, ans=0.125 2023-06-21 16:08:44,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.457e+02 2.800e+02 3.196e+02 5.475e+02, threshold=5.600e+02, percent-clipped=0.0 2023-06-21 16:09:19,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=800754.0, ans=0.0 2023-06-21 16:09:22,511 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:09:42,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-21 16:09:45,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800814.0, ans=0.1 2023-06-21 16:09:50,524 INFO [train.py:996] (0/4) Epoch 5, batch 11500, loss[loss=0.2074, simple_loss=0.3006, pruned_loss=0.05712, over 21728.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3087, pruned_loss=0.08054, over 4280710.99 frames. ], batch size: 298, lr: 6.32e-03, grad_scale: 16.0 2023-06-21 16:10:39,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=800934.0, ans=0.2 2023-06-21 16:11:50,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=801054.0, ans=0.2 2023-06-21 16:11:59,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-21 16:12:26,610 INFO [train.py:996] (0/4) Epoch 5, batch 11550, loss[loss=0.2747, simple_loss=0.3721, pruned_loss=0.08864, over 21773.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3142, pruned_loss=0.08037, over 4281075.53 frames. ], batch size: 332, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:13:41,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.641e+02 3.057e+02 3.432e+02 5.620e+02, threshold=6.114e+02, percent-clipped=1.0 2023-06-21 16:13:45,172 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:15:01,403 INFO [train.py:996] (0/4) Epoch 5, batch 11600, loss[loss=0.2449, simple_loss=0.3449, pruned_loss=0.07245, over 21652.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3253, pruned_loss=0.08176, over 4275312.78 frames. ], batch size: 263, lr: 6.31e-03, grad_scale: 32.0 2023-06-21 16:15:13,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=801474.0, ans=0.0 2023-06-21 16:17:18,048 INFO [train.py:996] (0/4) Epoch 5, batch 11650, loss[loss=0.2345, simple_loss=0.3149, pruned_loss=0.07705, over 21496.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3299, pruned_loss=0.08169, over 4274604.77 frames. ], batch size: 230, lr: 6.31e-03, grad_scale: 32.0 2023-06-21 16:18:27,546 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.640e+02 3.062e+02 3.776e+02 6.699e+02, threshold=6.124e+02, percent-clipped=2.0 2023-06-21 16:18:47,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801954.0, ans=0.1 2023-06-21 16:19:09,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=801954.0, ans=0.0 2023-06-21 16:19:28,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=802014.0, ans=0.1 2023-06-21 16:19:48,622 INFO [train.py:996] (0/4) Epoch 5, batch 11700, loss[loss=0.2385, simple_loss=0.2917, pruned_loss=0.0926, over 21563.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3204, pruned_loss=0.0808, over 4270987.02 frames. ], batch size: 415, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:20:24,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=802134.0, ans=0.125 2023-06-21 16:21:59,039 INFO [train.py:996] (0/4) Epoch 5, batch 11750, loss[loss=0.2051, simple_loss=0.26, pruned_loss=0.07508, over 21613.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3115, pruned_loss=0.08077, over 4266183.29 frames. ], batch size: 264, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:22:23,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=802374.0, ans=0.035 2023-06-21 16:22:47,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.493e+02 2.846e+02 3.170e+02 4.478e+02, threshold=5.693e+02, percent-clipped=0.0 2023-06-21 16:23:08,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=802494.0, ans=0.125 2023-06-21 16:24:01,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=802614.0, ans=0.125 2023-06-21 16:24:15,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=802614.0, ans=0.125 2023-06-21 16:24:20,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-06-21 16:24:20,700 INFO [train.py:996] (0/4) Epoch 5, batch 11800, loss[loss=0.2524, simple_loss=0.3257, pruned_loss=0.08952, over 21478.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3141, pruned_loss=0.08292, over 4259234.69 frames. ], batch size: 211, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:24:21,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=802674.0, ans=0.125 2023-06-21 16:24:30,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=802674.0, ans=0.0 2023-06-21 16:24:51,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=802734.0, ans=22.5 2023-06-21 16:25:55,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=802854.0, ans=0.2 2023-06-21 16:26:03,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=802854.0, ans=0.2 2023-06-21 16:26:37,620 INFO [train.py:996] (0/4) Epoch 5, batch 11850, loss[loss=0.2594, simple_loss=0.3367, pruned_loss=0.09102, over 21789.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3155, pruned_loss=0.08202, over 4268964.17 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:26:49,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-21 16:27:03,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=803034.0, ans=0.1 2023-06-21 16:27:04,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=803034.0, ans=0.125 2023-06-21 16:27:05,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-21 16:27:13,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=803034.0, ans=0.125 2023-06-21 16:27:35,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.385e+02 2.718e+02 3.148e+02 5.334e+02, threshold=5.436e+02, percent-clipped=0.0 2023-06-21 16:27:58,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=803094.0, ans=0.125 2023-06-21 16:28:00,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=803094.0, ans=0.125 2023-06-21 16:28:00,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-21 16:28:01,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=803094.0, ans=0.125 2023-06-21 16:28:43,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=803214.0, ans=0.0 2023-06-21 16:29:11,183 INFO [train.py:996] (0/4) Epoch 5, batch 11900, loss[loss=0.2484, simple_loss=0.3613, pruned_loss=0.06778, over 20870.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3162, pruned_loss=0.07912, over 4271256.38 frames. ], batch size: 608, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:29:37,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-21 16:29:50,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=803334.0, ans=0.035 2023-06-21 16:31:10,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=803514.0, ans=0.2 2023-06-21 16:31:13,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=803514.0, ans=0.125 2023-06-21 16:31:14,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=803514.0, ans=0.0 2023-06-21 16:31:16,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=803514.0, ans=0.125 2023-06-21 16:31:26,065 INFO [train.py:996] (0/4) Epoch 5, batch 11950, loss[loss=0.2102, simple_loss=0.3086, pruned_loss=0.05586, over 21698.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3163, pruned_loss=0.07575, over 4272494.89 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-21 16:32:21,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.301e+02 2.718e+02 3.108e+02 3.993e+02, threshold=5.436e+02, percent-clipped=0.0 2023-06-21 16:32:22,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=803694.0, ans=0.0 2023-06-21 16:32:41,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-21 16:33:39,362 INFO [train.py:996] (0/4) Epoch 5, batch 12000, loss[loss=0.2056, simple_loss=0.2743, pruned_loss=0.06845, over 21541.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3094, pruned_loss=0.07418, over 4270455.25 frames. ], batch size: 263, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:33:39,364 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 16:34:34,980 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2672, simple_loss=0.3583, pruned_loss=0.08803, over 1796401.00 frames. 2023-06-21 16:34:34,985 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 16:34:37,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=803874.0, ans=0.0 2023-06-21 16:34:59,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=803934.0, ans=0.125 2023-06-21 16:35:56,036 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:35:56,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=804054.0, ans=0.1 2023-06-21 16:35:58,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=804054.0, ans=0.0 2023-06-21 16:36:25,242 INFO [train.py:996] (0/4) Epoch 5, batch 12050, loss[loss=0.2409, simple_loss=0.3106, pruned_loss=0.08564, over 21880.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3074, pruned_loss=0.07701, over 4262873.53 frames. ], batch size: 351, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:36:39,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=804174.0, ans=0.035 2023-06-21 16:37:26,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.690e+02 3.066e+02 3.586e+02 5.948e+02, threshold=6.132e+02, percent-clipped=2.0 2023-06-21 16:38:14,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=804354.0, ans=0.0 2023-06-21 16:38:29,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=804414.0, ans=0.125 2023-06-21 16:38:41,167 INFO [train.py:996] (0/4) Epoch 5, batch 12100, loss[loss=0.28, simple_loss=0.3432, pruned_loss=0.1084, over 21389.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3113, pruned_loss=0.08154, over 4267708.05 frames. ], batch size: 548, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:38:43,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=804474.0, ans=0.0 2023-06-21 16:39:05,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=804474.0, ans=0.125 2023-06-21 16:39:55,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=804594.0, ans=0.125 2023-06-21 16:40:52,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=804714.0, ans=0.0 2023-06-21 16:41:30,865 INFO [train.py:996] (0/4) Epoch 5, batch 12150, loss[loss=0.2356, simple_loss=0.319, pruned_loss=0.07607, over 21006.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3141, pruned_loss=0.08006, over 4265514.41 frames. ], batch size: 607, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:41:58,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=804834.0, ans=0.2 2023-06-21 16:41:59,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=804834.0, ans=0.0 2023-06-21 16:42:01,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=804834.0, ans=0.09899494936611666 2023-06-21 16:42:10,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=804834.0, ans=0.125 2023-06-21 16:42:19,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.609e+02 3.178e+02 3.769e+02 6.443e+02, threshold=6.356e+02, percent-clipped=2.0 2023-06-21 16:42:41,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 16:43:01,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=804954.0, ans=0.2 2023-06-21 16:43:14,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=805014.0, ans=0.0 2023-06-21 16:43:41,193 INFO [train.py:996] (0/4) Epoch 5, batch 12200, loss[loss=0.2068, simple_loss=0.2662, pruned_loss=0.07374, over 21510.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3122, pruned_loss=0.07914, over 4257918.33 frames. ], batch size: 230, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:44:51,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-21 16:44:52,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805254.0, ans=0.1 2023-06-21 16:45:48,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-21 16:45:53,349 INFO [train.py:996] (0/4) Epoch 5, batch 12250, loss[loss=0.1624, simple_loss=0.2422, pruned_loss=0.04129, over 21533.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.304, pruned_loss=0.07584, over 4258663.70 frames. ], batch size: 212, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:46:35,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 2.451e+02 2.848e+02 3.373e+02 5.263e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-21 16:47:09,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=805554.0, ans=0.035 2023-06-21 16:47:28,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=805554.0, ans=0.125 2023-06-21 16:47:51,385 INFO [train.py:996] (0/4) Epoch 5, batch 12300, loss[loss=0.2686, simple_loss=0.3544, pruned_loss=0.09142, over 19994.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2962, pruned_loss=0.07057, over 4262339.28 frames. ], batch size: 702, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:48:35,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=805734.0, ans=0.2 2023-06-21 16:48:41,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=805734.0, ans=0.0 2023-06-21 16:49:07,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805794.0, ans=0.1 2023-06-21 16:49:21,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-21 16:50:30,865 INFO [train.py:996] (0/4) Epoch 5, batch 12350, loss[loss=0.2375, simple_loss=0.334, pruned_loss=0.07055, over 21776.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.301, pruned_loss=0.07171, over 4272268.25 frames. ], batch size: 332, lr: 6.30e-03, grad_scale: 32.0 2023-06-21 16:51:10,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=806094.0, ans=0.125 2023-06-21 16:51:10,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.374e+02 2.755e+02 3.213e+02 5.680e+02, threshold=5.510e+02, percent-clipped=0.0 2023-06-21 16:51:50,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=806154.0, ans=0.2 2023-06-21 16:52:13,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-21 16:52:32,118 INFO [train.py:996] (0/4) Epoch 5, batch 12400, loss[loss=0.2426, simple_loss=0.3018, pruned_loss=0.09168, over 21310.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3041, pruned_loss=0.07527, over 4279821.26 frames. ], batch size: 176, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 16:54:45,664 INFO [train.py:996] (0/4) Epoch 5, batch 12450, loss[loss=0.2689, simple_loss=0.3492, pruned_loss=0.09434, over 21411.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3064, pruned_loss=0.07799, over 4280016.06 frames. ], batch size: 131, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 16:54:46,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.10 vs. limit=15.0 2023-06-21 16:55:57,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.571e+02 2.934e+02 3.546e+02 5.506e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-21 16:55:59,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=806694.0, ans=0.2 2023-06-21 16:56:12,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=806694.0, ans=0.125 2023-06-21 16:56:33,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-21 16:56:44,071 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:56:50,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=806814.0, ans=0.125 2023-06-21 16:57:17,118 INFO [train.py:996] (0/4) Epoch 5, batch 12500, loss[loss=0.2599, simple_loss=0.3554, pruned_loss=0.08223, over 21692.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3185, pruned_loss=0.08238, over 4281497.56 frames. ], batch size: 298, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 16:58:46,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=807054.0, ans=0.125 2023-06-21 16:59:16,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=807114.0, ans=0.0 2023-06-21 16:59:45,086 INFO [train.py:996] (0/4) Epoch 5, batch 12550, loss[loss=0.2244, simple_loss=0.3127, pruned_loss=0.06803, over 21788.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3213, pruned_loss=0.08401, over 4279760.75 frames. ], batch size: 282, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 16:59:53,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=807174.0, ans=0.0 2023-06-21 16:59:54,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=807174.0, ans=0.0 2023-06-21 17:00:39,261 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.629e+02 2.998e+02 3.510e+02 7.002e+02, threshold=5.996e+02, percent-clipped=1.0 2023-06-21 17:00:42,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=807294.0, ans=0.2 2023-06-21 17:01:18,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=807354.0, ans=0.125 2023-06-21 17:01:54,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=807414.0, ans=0.1 2023-06-21 17:01:56,445 INFO [train.py:996] (0/4) Epoch 5, batch 12600, loss[loss=0.1725, simple_loss=0.2505, pruned_loss=0.04727, over 21195.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3197, pruned_loss=0.08168, over 4281417.34 frames. ], batch size: 176, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:04:10,529 INFO [train.py:996] (0/4) Epoch 5, batch 12650, loss[loss=0.2681, simple_loss=0.332, pruned_loss=0.1021, over 21864.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3123, pruned_loss=0.07834, over 4279886.49 frames. ], batch size: 118, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:05:09,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.417e+02 2.707e+02 3.120e+02 6.136e+02, threshold=5.414e+02, percent-clipped=1.0 2023-06-21 17:05:52,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=807954.0, ans=0.125 2023-06-21 17:05:56,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=807954.0, ans=0.05 2023-06-21 17:06:09,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=808014.0, ans=0.0 2023-06-21 17:06:10,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-21 17:06:18,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=808014.0, ans=0.125 2023-06-21 17:06:32,810 INFO [train.py:996] (0/4) Epoch 5, batch 12700, loss[loss=0.2619, simple_loss=0.3316, pruned_loss=0.0961, over 21475.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3122, pruned_loss=0.08085, over 4285714.73 frames. ], batch size: 194, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:07:33,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808194.0, ans=0.1 2023-06-21 17:08:21,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=808254.0, ans=0.125 2023-06-21 17:08:22,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=808254.0, ans=0.125 2023-06-21 17:08:41,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=808314.0, ans=0.125 2023-06-21 17:08:45,680 INFO [train.py:996] (0/4) Epoch 5, batch 12750, loss[loss=0.215, simple_loss=0.3028, pruned_loss=0.06362, over 21794.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3123, pruned_loss=0.08021, over 4286982.66 frames. ], batch size: 282, lr: 6.29e-03, grad_scale: 16.0 2023-06-21 17:09:43,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.507e+02 2.930e+02 3.517e+02 6.177e+02, threshold=5.859e+02, percent-clipped=3.0 2023-06-21 17:09:49,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-21 17:10:16,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-21 17:10:31,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808554.0, ans=0.1 2023-06-21 17:10:55,726 INFO [train.py:996] (0/4) Epoch 5, batch 12800, loss[loss=0.2372, simple_loss=0.3134, pruned_loss=0.08049, over 21820.00 frames. ], tot_loss[loss=0.238, simple_loss=0.313, pruned_loss=0.0815, over 4277472.48 frames. ], batch size: 282, lr: 6.29e-03, grad_scale: 32.0 2023-06-21 17:11:42,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=808734.0, ans=0.0 2023-06-21 17:12:12,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=808794.0, ans=0.0 2023-06-21 17:12:45,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=808854.0, ans=0.125 2023-06-21 17:13:22,378 INFO [train.py:996] (0/4) Epoch 5, batch 12850, loss[loss=0.2608, simple_loss=0.3348, pruned_loss=0.09336, over 21431.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3143, pruned_loss=0.08223, over 4279650.31 frames. ], batch size: 131, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:13:58,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=809034.0, ans=0.125 2023-06-21 17:14:21,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=809034.0, ans=0.2 2023-06-21 17:14:34,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.342e+02 2.616e+02 2.870e+02 3.698e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-21 17:14:41,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=809094.0, ans=0.125 2023-06-21 17:14:47,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=809154.0, ans=0.1 2023-06-21 17:14:59,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-21 17:15:18,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=809214.0, ans=0.125 2023-06-21 17:15:48,055 INFO [train.py:996] (0/4) Epoch 5, batch 12900, loss[loss=0.1933, simple_loss=0.278, pruned_loss=0.05426, over 21502.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3109, pruned_loss=0.07872, over 4275529.68 frames. ], batch size: 230, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:15:54,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=809274.0, ans=0.2 2023-06-21 17:16:55,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=809394.0, ans=0.1 2023-06-21 17:16:57,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=809394.0, ans=0.125 2023-06-21 17:17:04,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.73 vs. limit=22.5 2023-06-21 17:17:09,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=809454.0, ans=0.5 2023-06-21 17:17:23,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=809454.0, ans=0.2 2023-06-21 17:17:33,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=809454.0, ans=0.05 2023-06-21 17:17:42,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-21 17:18:10,370 INFO [train.py:996] (0/4) Epoch 5, batch 12950, loss[loss=0.2327, simple_loss=0.3201, pruned_loss=0.0726, over 21736.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3109, pruned_loss=0.07746, over 4272605.91 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:18:40,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-21 17:19:09,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-21 17:19:10,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.360e+02 2.684e+02 3.163e+02 5.049e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 17:20:25,657 INFO [train.py:996] (0/4) Epoch 5, batch 13000, loss[loss=0.169, simple_loss=0.2422, pruned_loss=0.04794, over 16088.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.31, pruned_loss=0.07735, over 4272184.04 frames. ], batch size: 60, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:20:59,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=809934.0, ans=0.125 2023-06-21 17:21:22,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=809934.0, ans=0.0 2023-06-21 17:22:23,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810114.0, ans=0.1 2023-06-21 17:22:48,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=810114.0, ans=0.0 2023-06-21 17:22:56,632 INFO [train.py:996] (0/4) Epoch 5, batch 13050, loss[loss=0.2059, simple_loss=0.2799, pruned_loss=0.06597, over 21659.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3069, pruned_loss=0.07546, over 4276281.81 frames. ], batch size: 230, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:23:22,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-21 17:23:39,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.464e+02 2.848e+02 3.249e+02 5.080e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-21 17:23:57,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-21 17:24:17,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=810354.0, ans=0.035 2023-06-21 17:25:03,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=810414.0, ans=0.2 2023-06-21 17:25:16,448 INFO [train.py:996] (0/4) Epoch 5, batch 13100, loss[loss=0.2237, simple_loss=0.3086, pruned_loss=0.06942, over 21795.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3073, pruned_loss=0.07579, over 4279132.25 frames. ], batch size: 332, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:25:59,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=810534.0, ans=0.125 2023-06-21 17:26:48,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=810654.0, ans=0.125 2023-06-21 17:27:23,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810714.0, ans=0.1 2023-06-21 17:27:37,121 INFO [train.py:996] (0/4) Epoch 5, batch 13150, loss[loss=0.2032, simple_loss=0.2763, pruned_loss=0.06501, over 21429.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3087, pruned_loss=0.07861, over 4277922.87 frames. ], batch size: 211, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:28:39,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.725e+02 3.119e+02 3.667e+02 5.511e+02, threshold=6.238e+02, percent-clipped=0.0 2023-06-21 17:29:46,999 INFO [train.py:996] (0/4) Epoch 5, batch 13200, loss[loss=0.2347, simple_loss=0.3226, pruned_loss=0.07341, over 20109.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3074, pruned_loss=0.07862, over 4274051.80 frames. ], batch size: 702, lr: 6.28e-03, grad_scale: 32.0 2023-06-21 17:30:33,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=811134.0, ans=0.0 2023-06-21 17:31:53,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=811314.0, ans=0.125 2023-06-21 17:32:01,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=811374.0, ans=0.125 2023-06-21 17:32:01,967 INFO [train.py:996] (0/4) Epoch 5, batch 13250, loss[loss=0.212, simple_loss=0.296, pruned_loss=0.06402, over 21392.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3084, pruned_loss=0.07896, over 4274103.20 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:33:16,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.617e+02 2.907e+02 3.504e+02 5.770e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-21 17:33:53,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=811554.0, ans=0.2 2023-06-21 17:34:06,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=811614.0, ans=0.125 2023-06-21 17:34:09,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=811614.0, ans=0.05 2023-06-21 17:34:30,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-21 17:34:31,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811674.0, ans=0.1 2023-06-21 17:34:32,727 INFO [train.py:996] (0/4) Epoch 5, batch 13300, loss[loss=0.2595, simple_loss=0.3263, pruned_loss=0.09636, over 21751.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3119, pruned_loss=0.07888, over 4275786.89 frames. ], batch size: 441, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:34:52,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811674.0, ans=0.1 2023-06-21 17:35:19,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=811734.0, ans=0.0 2023-06-21 17:36:05,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=811854.0, ans=0.0 2023-06-21 17:36:16,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=15.0 2023-06-21 17:36:54,456 INFO [train.py:996] (0/4) Epoch 5, batch 13350, loss[loss=0.2548, simple_loss=0.3229, pruned_loss=0.09338, over 21240.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3175, pruned_loss=0.08202, over 4276197.05 frames. ], batch size: 159, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:37:09,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=811974.0, ans=0.125 2023-06-21 17:38:04,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=812094.0, ans=0.0 2023-06-21 17:38:06,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.647e+02 2.976e+02 3.366e+02 5.108e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 17:39:15,665 INFO [train.py:996] (0/4) Epoch 5, batch 13400, loss[loss=0.2306, simple_loss=0.3008, pruned_loss=0.08019, over 21918.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3192, pruned_loss=0.08445, over 4284567.75 frames. ], batch size: 124, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:39:34,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=812274.0, ans=0.0 2023-06-21 17:39:38,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-21 17:40:37,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=812454.0, ans=0.125 2023-06-21 17:41:14,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=812514.0, ans=0.125 2023-06-21 17:41:42,770 INFO [train.py:996] (0/4) Epoch 5, batch 13450, loss[loss=0.2441, simple_loss=0.3, pruned_loss=0.09408, over 21744.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3203, pruned_loss=0.08668, over 4289653.34 frames. ], batch size: 124, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:42:31,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.609e+02 2.992e+02 3.362e+02 4.963e+02, threshold=5.984e+02, percent-clipped=0.0 2023-06-21 17:42:49,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=812694.0, ans=0.2 2023-06-21 17:42:50,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=812694.0, ans=0.125 2023-06-21 17:43:42,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=812814.0, ans=0.0 2023-06-21 17:43:58,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=812874.0, ans=0.0 2023-06-21 17:43:59,484 INFO [train.py:996] (0/4) Epoch 5, batch 13500, loss[loss=0.1609, simple_loss=0.2201, pruned_loss=0.05086, over 21327.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3129, pruned_loss=0.08375, over 4274234.92 frames. ], batch size: 176, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:44:55,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=812994.0, ans=0.0 2023-06-21 17:46:36,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=813174.0, ans=0.125 2023-06-21 17:46:36,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=813174.0, ans=0.0 2023-06-21 17:46:37,686 INFO [train.py:996] (0/4) Epoch 5, batch 13550, loss[loss=0.2602, simple_loss=0.3656, pruned_loss=0.07736, over 21642.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3167, pruned_loss=0.08247, over 4271457.83 frames. ], batch size: 389, lr: 6.27e-03, grad_scale: 16.0 2023-06-21 17:46:53,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-21 17:46:53,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-21 17:47:28,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.549e+02 2.990e+02 3.504e+02 5.055e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 17:48:06,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=813354.0, ans=0.025 2023-06-21 17:48:45,918 INFO [train.py:996] (0/4) Epoch 5, batch 13600, loss[loss=0.2135, simple_loss=0.2845, pruned_loss=0.07128, over 21281.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3172, pruned_loss=0.08268, over 4276795.87 frames. ], batch size: 143, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:49:58,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=813594.0, ans=0.125 2023-06-21 17:50:17,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=813654.0, ans=0.0 2023-06-21 17:50:31,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=813654.0, ans=0.125 2023-06-21 17:51:06,769 INFO [train.py:996] (0/4) Epoch 5, batch 13650, loss[loss=0.2016, simple_loss=0.2721, pruned_loss=0.06559, over 21756.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3127, pruned_loss=0.07995, over 4279886.13 frames. ], batch size: 316, lr: 6.27e-03, grad_scale: 32.0 2023-06-21 17:51:09,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-21 17:51:23,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=813834.0, ans=0.125 2023-06-21 17:51:51,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=813834.0, ans=0.125 2023-06-21 17:52:01,850 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.314e+02 2.698e+02 3.279e+02 5.824e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-21 17:52:32,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-21 17:53:12,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=814014.0, ans=0.125 2023-06-21 17:53:29,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=814074.0, ans=15.0 2023-06-21 17:53:30,165 INFO [train.py:996] (0/4) Epoch 5, batch 13700, loss[loss=0.2551, simple_loss=0.3275, pruned_loss=0.0914, over 20133.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.309, pruned_loss=0.08028, over 4273060.83 frames. ], batch size: 703, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:53:40,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=814074.0, ans=0.125 2023-06-21 17:54:25,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814194.0, ans=0.1 2023-06-21 17:55:39,246 INFO [train.py:996] (0/4) Epoch 5, batch 13750, loss[loss=0.1725, simple_loss=0.2435, pruned_loss=0.0507, over 21340.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.306, pruned_loss=0.0802, over 4273059.52 frames. ], batch size: 131, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:56:40,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.570e+02 2.907e+02 3.246e+02 5.241e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 17:57:28,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=814554.0, ans=0.125 2023-06-21 17:58:04,556 INFO [train.py:996] (0/4) Epoch 5, batch 13800, loss[loss=0.2856, simple_loss=0.3868, pruned_loss=0.09224, over 21675.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3103, pruned_loss=0.07959, over 4278576.50 frames. ], batch size: 389, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 17:58:20,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-21 17:58:32,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=814734.0, ans=0.0 2023-06-21 17:59:55,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=814914.0, ans=0.0 2023-06-21 18:00:36,378 INFO [train.py:996] (0/4) Epoch 5, batch 13850, loss[loss=0.2519, simple_loss=0.3253, pruned_loss=0.08927, over 21617.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3153, pruned_loss=0.08004, over 4275625.18 frames. ], batch size: 263, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:00:37,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=814974.0, ans=0.125 2023-06-21 18:01:12,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-21 18:01:21,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.680e+02 3.025e+02 3.461e+02 6.759e+02, threshold=6.050e+02, percent-clipped=1.0 2023-06-21 18:01:55,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=815154.0, ans=0.0 2023-06-21 18:02:11,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=815154.0, ans=0.0 2023-06-21 18:02:45,815 INFO [train.py:996] (0/4) Epoch 5, batch 13900, loss[loss=0.2594, simple_loss=0.3241, pruned_loss=0.09737, over 21689.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3204, pruned_loss=0.08362, over 4274779.51 frames. ], batch size: 389, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:03:02,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=815274.0, ans=0.1 2023-06-21 18:03:20,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=815334.0, ans=0.5 2023-06-21 18:03:55,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=815394.0, ans=0.0 2023-06-21 18:04:08,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=815454.0, ans=0.125 2023-06-21 18:04:59,420 INFO [train.py:996] (0/4) Epoch 5, batch 13950, loss[loss=0.2345, simple_loss=0.3112, pruned_loss=0.07892, over 21885.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3199, pruned_loss=0.08533, over 4286293.14 frames. ], batch size: 316, lr: 6.26e-03, grad_scale: 16.0 2023-06-21 18:05:56,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.602e+02 2.918e+02 3.271e+02 5.546e+02, threshold=5.836e+02, percent-clipped=0.0 2023-06-21 18:06:05,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=815694.0, ans=0.125 2023-06-21 18:06:07,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=815694.0, ans=0.125 2023-06-21 18:07:06,599 INFO [train.py:996] (0/4) Epoch 5, batch 14000, loss[loss=0.2107, simple_loss=0.3017, pruned_loss=0.05989, over 21819.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3146, pruned_loss=0.0822, over 4286084.20 frames. ], batch size: 282, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:07:07,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=815874.0, ans=0.125 2023-06-21 18:07:25,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=815934.0, ans=0.0 2023-06-21 18:08:04,402 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-136000.pt 2023-06-21 18:08:09,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=815994.0, ans=0.125 2023-06-21 18:08:14,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=815994.0, ans=0.0 2023-06-21 18:08:43,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-21 18:09:05,308 INFO [train.py:996] (0/4) Epoch 5, batch 14050, loss[loss=0.1873, simple_loss=0.2676, pruned_loss=0.05347, over 21682.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3107, pruned_loss=0.07854, over 4289135.30 frames. ], batch size: 247, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:09:48,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=816234.0, ans=0.2 2023-06-21 18:09:51,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=816234.0, ans=0.0 2023-06-21 18:09:59,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=816294.0, ans=0.0 2023-06-21 18:10:10,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 2.220e+02 2.635e+02 3.062e+02 5.472e+02, threshold=5.269e+02, percent-clipped=0.0 2023-06-21 18:11:16,439 INFO [train.py:996] (0/4) Epoch 5, batch 14100, loss[loss=0.2623, simple_loss=0.3273, pruned_loss=0.09864, over 21531.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.304, pruned_loss=0.07791, over 4286314.30 frames. ], batch size: 389, lr: 6.26e-03, grad_scale: 32.0 2023-06-21 18:12:20,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-21 18:13:12,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816774.0, ans=0.1 2023-06-21 18:13:18,686 INFO [train.py:996] (0/4) Epoch 5, batch 14150, loss[loss=0.2221, simple_loss=0.3044, pruned_loss=0.06991, over 21120.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.307, pruned_loss=0.07841, over 4279046.73 frames. ], batch size: 143, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:13:33,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816834.0, ans=0.1 2023-06-21 18:13:47,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=816834.0, ans=0.125 2023-06-21 18:14:05,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.286e+02 2.751e+02 3.276e+02 5.188e+02, threshold=5.503e+02, percent-clipped=0.0 2023-06-21 18:14:25,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=816954.0, ans=0.2 2023-06-21 18:14:51,756 INFO [train.py:996] (0/4) Epoch 5, batch 14200, loss[loss=0.2114, simple_loss=0.2843, pruned_loss=0.06922, over 21649.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3067, pruned_loss=0.07745, over 4268064.58 frames. ], batch size: 230, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:15:52,693 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:16:43,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=817314.0, ans=0.0 2023-06-21 18:16:58,886 INFO [train.py:996] (0/4) Epoch 5, batch 14250, loss[loss=0.2226, simple_loss=0.2913, pruned_loss=0.07696, over 21430.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3013, pruned_loss=0.07752, over 4264291.44 frames. ], batch size: 473, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:17:08,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=817374.0, ans=0.2 2023-06-21 18:17:57,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 2.265e+02 2.747e+02 3.168e+02 5.793e+02, threshold=5.495e+02, percent-clipped=1.0 2023-06-21 18:18:07,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=817494.0, ans=0.125 2023-06-21 18:19:15,223 INFO [train.py:996] (0/4) Epoch 5, batch 14300, loss[loss=0.2193, simple_loss=0.3137, pruned_loss=0.06248, over 21779.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3032, pruned_loss=0.07667, over 4271376.57 frames. ], batch size: 282, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:19:16,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-21 18:20:43,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=817854.0, ans=0.0 2023-06-21 18:21:03,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=817914.0, ans=0.2 2023-06-21 18:21:11,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=817914.0, ans=0.125 2023-06-21 18:21:34,200 INFO [train.py:996] (0/4) Epoch 5, batch 14350, loss[loss=0.1881, simple_loss=0.2552, pruned_loss=0.06051, over 21363.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3105, pruned_loss=0.07816, over 4271386.04 frames. ], batch size: 131, lr: 6.25e-03, grad_scale: 16.0 2023-06-21 18:21:56,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=817974.0, ans=0.2 2023-06-21 18:22:04,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=818034.0, ans=0.125 2023-06-21 18:22:48,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.439e+02 2.916e+02 4.284e+02 1.022e+03, threshold=5.832e+02, percent-clipped=15.0 2023-06-21 18:22:55,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=818094.0, ans=15.0 2023-06-21 18:23:46,437 INFO [train.py:996] (0/4) Epoch 5, batch 14400, loss[loss=0.2152, simple_loss=0.2772, pruned_loss=0.07658, over 21498.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3073, pruned_loss=0.07862, over 4276868.50 frames. ], batch size: 212, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:24:10,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=818274.0, ans=0.0 2023-06-21 18:24:18,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=818334.0, ans=0.2 2023-06-21 18:25:55,574 INFO [train.py:996] (0/4) Epoch 5, batch 14450, loss[loss=0.2043, simple_loss=0.2713, pruned_loss=0.06861, over 21577.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3026, pruned_loss=0.07927, over 4277762.92 frames. ], batch size: 212, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:25:56,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=818574.0, ans=12.0 2023-06-21 18:26:28,098 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:27:01,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.447e+02 2.694e+02 3.271e+02 4.968e+02, threshold=5.388e+02, percent-clipped=0.0 2023-06-21 18:27:23,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=818814.0, ans=0.125 2023-06-21 18:27:53,504 INFO [train.py:996] (0/4) Epoch 5, batch 14500, loss[loss=0.2243, simple_loss=0.3061, pruned_loss=0.07127, over 21608.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2979, pruned_loss=0.07829, over 4274875.20 frames. ], batch size: 263, lr: 6.25e-03, grad_scale: 32.0 2023-06-21 18:27:59,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=818874.0, ans=0.125 2023-06-21 18:28:28,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=818874.0, ans=0.0 2023-06-21 18:29:19,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=818994.0, ans=0.0 2023-06-21 18:29:21,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=819054.0, ans=0.2 2023-06-21 18:29:23,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-21 18:29:26,956 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:29:54,417 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:29:59,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=819114.0, ans=0.125 2023-06-21 18:30:17,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=819114.0, ans=0.02 2023-06-21 18:30:23,178 INFO [train.py:996] (0/4) Epoch 5, batch 14550, loss[loss=0.2821, simple_loss=0.3553, pruned_loss=0.1045, over 21497.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3026, pruned_loss=0.08036, over 4274316.55 frames. ], batch size: 131, lr: 6.24e-03, grad_scale: 32.0 2023-06-21 18:30:44,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=819174.0, ans=0.0 2023-06-21 18:31:33,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.710e+02 3.093e+02 3.463e+02 5.528e+02, threshold=6.187e+02, percent-clipped=2.0 2023-06-21 18:31:59,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=819354.0, ans=10.0 2023-06-21 18:32:38,578 INFO [train.py:996] (0/4) Epoch 5, batch 14600, loss[loss=0.2725, simple_loss=0.3456, pruned_loss=0.09968, over 21554.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3103, pruned_loss=0.0841, over 4274583.31 frames. ], batch size: 414, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:33:24,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=819534.0, ans=0.125 2023-06-21 18:33:31,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=819534.0, ans=0.125 2023-06-21 18:33:31,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=819534.0, ans=0.1 2023-06-21 18:33:56,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-21 18:34:20,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=819654.0, ans=0.125 2023-06-21 18:34:35,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=819714.0, ans=0.07 2023-06-21 18:34:57,390 INFO [train.py:996] (0/4) Epoch 5, batch 14650, loss[loss=0.1973, simple_loss=0.2744, pruned_loss=0.06014, over 21361.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3109, pruned_loss=0.08261, over 4267946.44 frames. ], batch size: 159, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:35:40,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=819834.0, ans=0.1 2023-06-21 18:35:59,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.281e+02 2.602e+02 3.168e+02 7.024e+02, threshold=5.204e+02, percent-clipped=1.0 2023-06-21 18:36:04,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-21 18:36:16,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-21 18:36:55,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=820014.0, ans=0.0 2023-06-21 18:36:55,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=820014.0, ans=0.2 2023-06-21 18:36:56,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=820014.0, ans=0.125 2023-06-21 18:36:57,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=820014.0, ans=0.125 2023-06-21 18:37:02,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=820014.0, ans=0.0 2023-06-21 18:37:05,119 INFO [train.py:996] (0/4) Epoch 5, batch 14700, loss[loss=0.2008, simple_loss=0.2938, pruned_loss=0.05387, over 21758.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3048, pruned_loss=0.07692, over 4265461.07 frames. ], batch size: 247, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:37:28,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=820074.0, ans=0.125 2023-06-21 18:38:45,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-21 18:39:36,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-21 18:39:44,755 INFO [train.py:996] (0/4) Epoch 5, batch 14750, loss[loss=0.2305, simple_loss=0.2911, pruned_loss=0.08489, over 20242.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3092, pruned_loss=0.07908, over 4267013.18 frames. ], batch size: 707, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:40:20,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=820434.0, ans=0.125 2023-06-21 18:40:41,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.601e+02 3.057e+02 3.762e+02 6.456e+02, threshold=6.114e+02, percent-clipped=6.0 2023-06-21 18:41:54,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-21 18:41:58,015 INFO [train.py:996] (0/4) Epoch 5, batch 14800, loss[loss=0.2592, simple_loss=0.3262, pruned_loss=0.09607, over 20787.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3198, pruned_loss=0.08479, over 4260278.85 frames. ], batch size: 611, lr: 6.24e-03, grad_scale: 32.0 2023-06-21 18:42:05,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=820674.0, ans=0.2 2023-06-21 18:42:06,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-21 18:43:08,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=820794.0, ans=0.1 2023-06-21 18:44:02,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=820914.0, ans=0.95 2023-06-21 18:44:07,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-21 18:44:18,460 INFO [train.py:996] (0/4) Epoch 5, batch 14850, loss[loss=0.224, simple_loss=0.2944, pruned_loss=0.07682, over 21570.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3147, pruned_loss=0.08473, over 4254569.06 frames. ], batch size: 230, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:45:46,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.574e+02 2.949e+02 3.565e+02 8.325e+02, threshold=5.898e+02, percent-clipped=1.0 2023-06-21 18:45:54,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-21 18:46:15,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=821154.0, ans=0.125 2023-06-21 18:46:29,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=821214.0, ans=0.125 2023-06-21 18:46:29,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=821214.0, ans=0.0 2023-06-21 18:46:46,434 INFO [train.py:996] (0/4) Epoch 5, batch 14900, loss[loss=0.2488, simple_loss=0.3168, pruned_loss=0.09044, over 21578.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3168, pruned_loss=0.08609, over 4259667.45 frames. ], batch size: 263, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:46:49,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=821274.0, ans=0.125 2023-06-21 18:46:51,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=821274.0, ans=0.125 2023-06-21 18:48:05,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-21 18:48:35,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=821454.0, ans=0.0 2023-06-21 18:48:58,101 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:49:05,463 INFO [train.py:996] (0/4) Epoch 5, batch 14950, loss[loss=0.2706, simple_loss=0.3401, pruned_loss=0.1006, over 21255.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3184, pruned_loss=0.08559, over 4257572.63 frames. ], batch size: 176, lr: 6.24e-03, grad_scale: 16.0 2023-06-21 18:49:08,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=821574.0, ans=0.125 2023-06-21 18:49:21,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=821574.0, ans=0.125 2023-06-21 18:50:07,189 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:50:20,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=821694.0, ans=0.0 2023-06-21 18:50:24,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.514e+02 2.835e+02 3.523e+02 6.432e+02, threshold=5.669e+02, percent-clipped=1.0 2023-06-21 18:50:26,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=22.5 2023-06-21 18:50:40,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=821754.0, ans=0.0 2023-06-21 18:50:44,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-21 18:51:05,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=821814.0, ans=0.2 2023-06-21 18:51:26,644 INFO [train.py:996] (0/4) Epoch 5, batch 15000, loss[loss=0.2443, simple_loss=0.3257, pruned_loss=0.08143, over 19706.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3209, pruned_loss=0.08722, over 4265797.71 frames. ], batch size: 703, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:51:26,646 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 18:52:15,532 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2599, simple_loss=0.3537, pruned_loss=0.08302, over 1796401.00 frames. 2023-06-21 18:52:15,534 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 18:52:28,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821874.0, ans=0.1 2023-06-21 18:53:01,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=821934.0, ans=0.2 2023-06-21 18:53:19,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-21 18:53:50,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=822114.0, ans=0.125 2023-06-21 18:54:13,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=822114.0, ans=0.0 2023-06-21 18:54:13,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=822114.0, ans=0.125 2023-06-21 18:54:13,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-21 18:54:18,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=822114.0, ans=0.0 2023-06-21 18:54:19,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=822114.0, ans=0.125 2023-06-21 18:54:21,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=822114.0, ans=0.2 2023-06-21 18:54:23,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2023-06-21 18:54:29,552 INFO [train.py:996] (0/4) Epoch 5, batch 15050, loss[loss=0.2457, simple_loss=0.3394, pruned_loss=0.07595, over 21742.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3226, pruned_loss=0.08799, over 4255116.03 frames. ], batch size: 332, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:55:40,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=822294.0, ans=0.125 2023-06-21 18:55:41,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.696e+02 3.166e+02 3.893e+02 6.757e+02, threshold=6.331e+02, percent-clipped=4.0 2023-06-21 18:56:01,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-21 18:56:34,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-21 18:56:56,370 INFO [train.py:996] (0/4) Epoch 5, batch 15100, loss[loss=0.2336, simple_loss=0.3109, pruned_loss=0.07816, over 21803.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3255, pruned_loss=0.08771, over 4261433.37 frames. ], batch size: 282, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:57:12,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=822474.0, ans=0.1 2023-06-21 18:57:30,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=822534.0, ans=0.2 2023-06-21 18:58:51,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-21 18:59:23,183 INFO [train.py:996] (0/4) Epoch 5, batch 15150, loss[loss=0.2551, simple_loss=0.3417, pruned_loss=0.0842, over 20731.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3214, pruned_loss=0.08736, over 4260053.34 frames. ], batch size: 607, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 18:59:34,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=822774.0, ans=0.1 2023-06-21 18:59:38,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=822834.0, ans=0.2 2023-06-21 18:59:59,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-21 19:00:22,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.670e+02 3.218e+02 3.613e+02 4.681e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 19:00:37,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=822954.0, ans=0.125 2023-06-21 19:01:35,530 INFO [train.py:996] (0/4) Epoch 5, batch 15200, loss[loss=0.1968, simple_loss=0.2675, pruned_loss=0.06304, over 22016.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3111, pruned_loss=0.08269, over 4254892.52 frames. ], batch size: 103, lr: 6.23e-03, grad_scale: 32.0 2023-06-21 19:01:40,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=823074.0, ans=0.0 2023-06-21 19:01:43,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=823074.0, ans=0.0 2023-06-21 19:01:53,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=823134.0, ans=0.1 2023-06-21 19:02:49,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=823254.0, ans=0.125 2023-06-21 19:03:37,116 INFO [train.py:996] (0/4) Epoch 5, batch 15250, loss[loss=0.219, simple_loss=0.2821, pruned_loss=0.07792, over 21820.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3052, pruned_loss=0.08125, over 4260633.95 frames. ], batch size: 317, lr: 6.23e-03, grad_scale: 32.0 2023-06-21 19:03:41,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823374.0, ans=0.1 2023-06-21 19:04:27,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=823494.0, ans=10.0 2023-06-21 19:04:29,208 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:04:31,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823494.0, ans=0.1 2023-06-21 19:04:36,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.376e+02 2.792e+02 3.272e+02 5.081e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-21 19:04:58,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=823554.0, ans=0.125 2023-06-21 19:05:50,672 INFO [train.py:996] (0/4) Epoch 5, batch 15300, loss[loss=0.3061, simple_loss=0.358, pruned_loss=0.1271, over 21418.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3088, pruned_loss=0.08448, over 4245830.72 frames. ], batch size: 471, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:06:04,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=823674.0, ans=0.125 2023-06-21 19:06:14,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-21 19:06:28,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-21 19:06:28,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823734.0, ans=0.1 2023-06-21 19:06:59,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=823794.0, ans=0.1 2023-06-21 19:07:23,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=823854.0, ans=0.0 2023-06-21 19:07:30,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-21 19:07:42,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=823854.0, ans=0.5 2023-06-21 19:08:06,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=823914.0, ans=0.1 2023-06-21 19:08:11,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823974.0, ans=0.1 2023-06-21 19:08:12,185 INFO [train.py:996] (0/4) Epoch 5, batch 15350, loss[loss=0.267, simple_loss=0.3358, pruned_loss=0.09907, over 21901.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3153, pruned_loss=0.0868, over 4255202.48 frames. ], batch size: 371, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:08:39,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=824034.0, ans=0.0 2023-06-21 19:09:18,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.632e+02 3.057e+02 3.588e+02 5.490e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-21 19:09:47,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-21 19:10:20,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=824214.0, ans=0.125 2023-06-21 19:10:24,825 INFO [train.py:996] (0/4) Epoch 5, batch 15400, loss[loss=0.2339, simple_loss=0.3072, pruned_loss=0.08026, over 21738.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3159, pruned_loss=0.08428, over 4245171.39 frames. ], batch size: 389, lr: 6.23e-03, grad_scale: 16.0 2023-06-21 19:10:35,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=824274.0, ans=0.1 2023-06-21 19:10:47,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=824334.0, ans=0.2 2023-06-21 19:11:08,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=824394.0, ans=0.2 2023-06-21 19:12:33,773 INFO [train.py:996] (0/4) Epoch 5, batch 15450, loss[loss=0.2105, simple_loss=0.2977, pruned_loss=0.06163, over 21616.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3136, pruned_loss=0.0835, over 4248236.64 frames. ], batch size: 263, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:12:34,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=824574.0, ans=0.125 2023-06-21 19:12:35,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=824574.0, ans=0.125 2023-06-21 19:13:11,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-21 19:13:30,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.430e+02 2.746e+02 3.207e+02 5.836e+02, threshold=5.491e+02, percent-clipped=0.0 2023-06-21 19:13:31,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=824694.0, ans=0.0 2023-06-21 19:13:54,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-21 19:13:54,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-21 19:14:03,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=824754.0, ans=0.035 2023-06-21 19:14:48,582 INFO [train.py:996] (0/4) Epoch 5, batch 15500, loss[loss=0.2829, simple_loss=0.3734, pruned_loss=0.09618, over 18298.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3152, pruned_loss=0.0835, over 4254698.62 frames. ], batch size: 60, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:16:06,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=824994.0, ans=0.125 2023-06-21 19:16:40,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=825054.0, ans=0.015 2023-06-21 19:17:11,612 INFO [train.py:996] (0/4) Epoch 5, batch 15550, loss[loss=0.2365, simple_loss=0.308, pruned_loss=0.08253, over 21560.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3122, pruned_loss=0.08086, over 4262162.88 frames. ], batch size: 441, lr: 6.22e-03, grad_scale: 16.0 2023-06-21 19:17:17,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=825174.0, ans=0.07 2023-06-21 19:17:36,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=825234.0, ans=0.125 2023-06-21 19:17:48,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=825234.0, ans=0.0 2023-06-21 19:18:04,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=825294.0, ans=0.07 2023-06-21 19:18:22,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.386e+02 2.767e+02 3.218e+02 7.331e+02, threshold=5.534e+02, percent-clipped=1.0 2023-06-21 19:18:48,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=825354.0, ans=0.125 2023-06-21 19:19:21,996 INFO [train.py:996] (0/4) Epoch 5, batch 15600, loss[loss=0.2086, simple_loss=0.2759, pruned_loss=0.07061, over 21758.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3053, pruned_loss=0.07958, over 4262648.79 frames. ], batch size: 351, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:19:55,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-21 19:20:02,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=15.0 2023-06-21 19:20:06,393 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:20:11,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=825594.0, ans=0.0 2023-06-21 19:20:48,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=825654.0, ans=0.125 2023-06-21 19:21:05,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=825654.0, ans=0.125 2023-06-21 19:21:08,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=825654.0, ans=0.125 2023-06-21 19:21:31,853 INFO [train.py:996] (0/4) Epoch 5, batch 15650, loss[loss=0.2197, simple_loss=0.2889, pruned_loss=0.07527, over 21813.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.305, pruned_loss=0.07922, over 4258603.87 frames. ], batch size: 102, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:21:47,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=825774.0, ans=0.2 2023-06-21 19:22:09,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=825834.0, ans=0.125 2023-06-21 19:22:36,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=825894.0, ans=0.125 2023-06-21 19:22:54,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.460e+02 2.777e+02 3.351e+02 5.058e+02, threshold=5.554e+02, percent-clipped=0.0 2023-06-21 19:23:29,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=826014.0, ans=0.125 2023-06-21 19:23:48,492 INFO [train.py:996] (0/4) Epoch 5, batch 15700, loss[loss=0.2061, simple_loss=0.2845, pruned_loss=0.06385, over 21534.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.302, pruned_loss=0.07811, over 4261569.53 frames. ], batch size: 230, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:24:26,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=826134.0, ans=0.5 2023-06-21 19:25:45,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=826314.0, ans=0.125 2023-06-21 19:26:06,628 INFO [train.py:996] (0/4) Epoch 5, batch 15750, loss[loss=0.2108, simple_loss=0.285, pruned_loss=0.06826, over 21180.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2976, pruned_loss=0.07741, over 4266770.11 frames. ], batch size: 143, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:26:36,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=826434.0, ans=0.0 2023-06-21 19:27:19,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.398e+02 2.705e+02 3.127e+02 4.328e+02, threshold=5.411e+02, percent-clipped=0.0 2023-06-21 19:27:41,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=826554.0, ans=0.05 2023-06-21 19:28:12,453 INFO [train.py:996] (0/4) Epoch 5, batch 15800, loss[loss=0.1958, simple_loss=0.2559, pruned_loss=0.06789, over 21266.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2935, pruned_loss=0.07691, over 4268342.40 frames. ], batch size: 176, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:28:43,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=826734.0, ans=0.125 2023-06-21 19:29:25,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=826794.0, ans=0.125 2023-06-21 19:30:04,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=826914.0, ans=0.0 2023-06-21 19:30:24,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=826974.0, ans=0.125 2023-06-21 19:30:25,370 INFO [train.py:996] (0/4) Epoch 5, batch 15850, loss[loss=0.1996, simple_loss=0.2601, pruned_loss=0.06959, over 21628.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2968, pruned_loss=0.07892, over 4257894.28 frames. ], batch size: 282, lr: 6.22e-03, grad_scale: 32.0 2023-06-21 19:30:33,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=6.0 2023-06-21 19:30:36,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=826974.0, ans=0.125 2023-06-21 19:31:35,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.608e+02 3.041e+02 3.663e+02 6.488e+02, threshold=6.081e+02, percent-clipped=4.0 2023-06-21 19:31:36,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=827094.0, ans=0.125 2023-06-21 19:32:15,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.31 vs. limit=6.0 2023-06-21 19:32:19,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=827214.0, ans=0.125 2023-06-21 19:32:24,221 INFO [train.py:996] (0/4) Epoch 5, batch 15900, loss[loss=0.1861, simple_loss=0.2509, pruned_loss=0.06067, over 21375.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2976, pruned_loss=0.07943, over 4258994.52 frames. ], batch size: 211, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:32:40,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=827334.0, ans=22.5 2023-06-21 19:33:16,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.82 vs. limit=12.0 2023-06-21 19:33:24,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-21 19:34:21,043 INFO [train.py:996] (0/4) Epoch 5, batch 15950, loss[loss=0.1765, simple_loss=0.2607, pruned_loss=0.04621, over 21227.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2958, pruned_loss=0.07633, over 4263450.31 frames. ], batch size: 159, lr: 6.21e-03, grad_scale: 16.0 2023-06-21 19:35:16,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=827694.0, ans=0.0 2023-06-21 19:35:37,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.379e+02 2.684e+02 3.180e+02 4.998e+02, threshold=5.368e+02, percent-clipped=0.0 2023-06-21 19:35:45,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=827754.0, ans=0.0 2023-06-21 19:36:22,746 INFO [train.py:996] (0/4) Epoch 5, batch 16000, loss[loss=0.2345, simple_loss=0.3125, pruned_loss=0.07825, over 21250.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2964, pruned_loss=0.07422, over 4265879.20 frames. ], batch size: 143, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:36:26,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=827874.0, ans=0.125 2023-06-21 19:37:47,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=828054.0, ans=0.125 2023-06-21 19:38:41,459 INFO [train.py:996] (0/4) Epoch 5, batch 16050, loss[loss=0.312, simple_loss=0.3995, pruned_loss=0.1123, over 21510.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3004, pruned_loss=0.07305, over 4275534.69 frames. ], batch size: 471, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:38:52,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-21 19:39:12,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-21 19:39:15,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-21 19:39:24,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-21 19:39:54,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.402e+02 2.679e+02 3.498e+02 5.563e+02, threshold=5.357e+02, percent-clipped=1.0 2023-06-21 19:40:15,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=828354.0, ans=0.0 2023-06-21 19:40:44,448 INFO [train.py:996] (0/4) Epoch 5, batch 16100, loss[loss=0.2303, simple_loss=0.2968, pruned_loss=0.08192, over 21881.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.303, pruned_loss=0.0741, over 4278140.91 frames. ], batch size: 316, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:41:25,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=828534.0, ans=0.125 2023-06-21 19:42:42,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=828654.0, ans=0.125 2023-06-21 19:42:52,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=828714.0, ans=0.2 2023-06-21 19:43:08,409 INFO [train.py:996] (0/4) Epoch 5, batch 16150, loss[loss=0.2776, simple_loss=0.3441, pruned_loss=0.1056, over 21568.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3027, pruned_loss=0.07676, over 4287337.63 frames. ], batch size: 471, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:44:19,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.586e+02 2.871e+02 3.411e+02 6.404e+02, threshold=5.741e+02, percent-clipped=1.0 2023-06-21 19:44:27,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-21 19:44:28,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-21 19:45:12,445 INFO [train.py:996] (0/4) Epoch 5, batch 16200, loss[loss=0.2188, simple_loss=0.2899, pruned_loss=0.07389, over 21587.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3064, pruned_loss=0.07866, over 4290943.76 frames. ], batch size: 212, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:45:47,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=22.5 2023-06-21 19:45:54,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=829134.0, ans=0.125 2023-06-21 19:47:14,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=829314.0, ans=0.125 2023-06-21 19:47:19,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=829314.0, ans=0.0 2023-06-21 19:47:35,956 INFO [train.py:996] (0/4) Epoch 5, batch 16250, loss[loss=0.2446, simple_loss=0.3057, pruned_loss=0.09175, over 21455.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3075, pruned_loss=0.07922, over 4285861.72 frames. ], batch size: 509, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:47:52,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=829434.0, ans=0.125 2023-06-21 19:48:17,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=829434.0, ans=0.5 2023-06-21 19:48:54,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.259e+02 2.680e+02 3.153e+02 6.826e+02, threshold=5.361e+02, percent-clipped=1.0 2023-06-21 19:49:07,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-21 19:49:43,585 INFO [train.py:996] (0/4) Epoch 5, batch 16300, loss[loss=0.2438, simple_loss=0.3215, pruned_loss=0.08305, over 21547.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3028, pruned_loss=0.07488, over 4281745.56 frames. ], batch size: 230, lr: 6.21e-03, grad_scale: 32.0 2023-06-21 19:50:14,168 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:50:20,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-21 19:50:43,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-21 19:50:54,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-21 19:50:58,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-21 19:51:41,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=829914.0, ans=0.2 2023-06-21 19:52:05,558 INFO [train.py:996] (0/4) Epoch 5, batch 16350, loss[loss=0.2451, simple_loss=0.3579, pruned_loss=0.06613, over 20760.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3038, pruned_loss=0.07633, over 4284290.49 frames. ], batch size: 608, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:52:06,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=829974.0, ans=0.0 2023-06-21 19:52:27,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=829974.0, ans=0.0 2023-06-21 19:52:30,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-21 19:53:25,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.400e+02 2.710e+02 3.275e+02 5.510e+02, threshold=5.421e+02, percent-clipped=1.0 2023-06-21 19:53:47,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=830214.0, ans=0.0 2023-06-21 19:54:20,645 INFO [train.py:996] (0/4) Epoch 5, batch 16400, loss[loss=0.2506, simple_loss=0.308, pruned_loss=0.09655, over 21552.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3086, pruned_loss=0.07766, over 4288897.22 frames. ], batch size: 548, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:55:47,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=830454.0, ans=0.0 2023-06-21 19:56:39,866 INFO [train.py:996] (0/4) Epoch 5, batch 16450, loss[loss=0.2608, simple_loss=0.328, pruned_loss=0.09681, over 21847.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3084, pruned_loss=0.07949, over 4292962.81 frames. ], batch size: 414, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 19:56:57,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=15.0 2023-06-21 19:57:45,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=830694.0, ans=0.125 2023-06-21 19:57:51,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-21 19:57:55,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=830694.0, ans=0.125 2023-06-21 19:57:57,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.625e+02 2.903e+02 3.534e+02 6.213e+02, threshold=5.806e+02, percent-clipped=3.0 2023-06-21 19:57:59,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=830754.0, ans=0.125 2023-06-21 19:58:45,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=830814.0, ans=0.0 2023-06-21 19:58:58,127 INFO [train.py:996] (0/4) Epoch 5, batch 16500, loss[loss=0.1801, simple_loss=0.232, pruned_loss=0.06412, over 21200.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.305, pruned_loss=0.07967, over 4294003.22 frames. ], batch size: 143, lr: 6.20e-03, grad_scale: 32.0 2023-06-21 20:00:20,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=830994.0, ans=0.0 2023-06-21 20:00:38,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=831054.0, ans=0.1 2023-06-21 20:01:20,112 INFO [train.py:996] (0/4) Epoch 5, batch 16550, loss[loss=0.3015, simple_loss=0.41, pruned_loss=0.09649, over 19846.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3005, pruned_loss=0.07646, over 4284800.88 frames. ], batch size: 702, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:02:01,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-21 20:02:39,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.770e+02 3.273e+02 4.136e+02 6.995e+02, threshold=6.546e+02, percent-clipped=5.0 2023-06-21 20:03:48,555 INFO [train.py:996] (0/4) Epoch 5, batch 16600, loss[loss=0.2617, simple_loss=0.367, pruned_loss=0.07819, over 21847.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3094, pruned_loss=0.08001, over 4283275.17 frames. ], batch size: 282, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:04:52,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=831594.0, ans=0.125 2023-06-21 20:04:59,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-21 20:05:32,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-21 20:05:58,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=831714.0, ans=0.0 2023-06-21 20:06:01,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=831714.0, ans=0.125 2023-06-21 20:06:10,108 INFO [train.py:996] (0/4) Epoch 5, batch 16650, loss[loss=0.2909, simple_loss=0.376, pruned_loss=0.1029, over 21838.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3189, pruned_loss=0.0836, over 4276707.52 frames. ], batch size: 118, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:06:54,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=831834.0, ans=0.125 2023-06-21 20:07:15,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=831894.0, ans=0.125 2023-06-21 20:07:21,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.847e+02 3.289e+02 3.856e+02 6.866e+02, threshold=6.578e+02, percent-clipped=1.0 2023-06-21 20:08:00,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=831954.0, ans=0.0 2023-06-21 20:08:11,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=832014.0, ans=0.0 2023-06-21 20:08:22,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=832014.0, ans=0.125 2023-06-21 20:08:28,191 INFO [train.py:996] (0/4) Epoch 5, batch 16700, loss[loss=0.2225, simple_loss=0.3212, pruned_loss=0.06196, over 20734.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3198, pruned_loss=0.08393, over 4274492.84 frames. ], batch size: 608, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:08:38,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=832074.0, ans=0.125 2023-06-21 20:08:39,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-21 20:08:52,566 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:09:48,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-21 20:10:44,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=832314.0, ans=0.125 2023-06-21 20:10:56,382 INFO [train.py:996] (0/4) Epoch 5, batch 16750, loss[loss=0.3037, simple_loss=0.3874, pruned_loss=0.11, over 21468.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3228, pruned_loss=0.08479, over 4273414.47 frames. ], batch size: 471, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 20:11:43,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=832434.0, ans=0.2 2023-06-21 20:12:05,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-21 20:12:07,297 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:12:34,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.676e+02 3.031e+02 3.509e+02 7.132e+02, threshold=6.063e+02, percent-clipped=1.0 2023-06-21 20:13:23,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=832614.0, ans=0.1 2023-06-21 20:13:37,292 INFO [train.py:996] (0/4) Epoch 5, batch 16800, loss[loss=0.2939, simple_loss=0.3629, pruned_loss=0.1125, over 21605.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3275, pruned_loss=0.08564, over 4271611.26 frames. ], batch size: 471, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:14:12,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832674.0, ans=0.1 2023-06-21 20:14:24,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832734.0, ans=0.1 2023-06-21 20:14:39,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=832734.0, ans=0.5 2023-06-21 20:14:42,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-21 20:14:45,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.63 vs. limit=10.0 2023-06-21 20:14:47,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=832794.0, ans=0.125 2023-06-21 20:15:27,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=832914.0, ans=0.0 2023-06-21 20:15:58,069 INFO [train.py:996] (0/4) Epoch 5, batch 16850, loss[loss=0.2528, simple_loss=0.3197, pruned_loss=0.09296, over 21918.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3238, pruned_loss=0.08515, over 4277480.02 frames. ], batch size: 107, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:16:28,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=833034.0, ans=0.5 2023-06-21 20:16:30,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=833034.0, ans=0.0 2023-06-21 20:16:33,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=833034.0, ans=0.2 2023-06-21 20:17:05,596 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.685e+02 3.021e+02 3.703e+02 6.356e+02, threshold=6.041e+02, percent-clipped=1.0 2023-06-21 20:18:07,186 INFO [train.py:996] (0/4) Epoch 5, batch 16900, loss[loss=0.1806, simple_loss=0.2504, pruned_loss=0.05539, over 21331.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3187, pruned_loss=0.0841, over 4278177.24 frames. ], batch size: 211, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:18:09,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=833274.0, ans=0.125 2023-06-21 20:19:00,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-21 20:19:01,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=833394.0, ans=0.125 2023-06-21 20:19:40,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=833454.0, ans=0.125 2023-06-21 20:20:12,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-21 20:20:20,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=833514.0, ans=0.0 2023-06-21 20:20:22,854 INFO [train.py:996] (0/4) Epoch 5, batch 16950, loss[loss=0.2218, simple_loss=0.2874, pruned_loss=0.07814, over 21314.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3119, pruned_loss=0.08257, over 4281548.15 frames. ], batch size: 176, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:21:46,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.433e+02 2.745e+02 3.481e+02 5.788e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 20:22:39,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=833814.0, ans=0.125 2023-06-21 20:22:52,011 INFO [train.py:996] (0/4) Epoch 5, batch 17000, loss[loss=0.2578, simple_loss=0.3204, pruned_loss=0.0976, over 21923.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3097, pruned_loss=0.08281, over 4289614.05 frames. ], batch size: 414, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:23:01,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.42 vs. limit=15.0 2023-06-21 20:23:23,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=833934.0, ans=0.125 2023-06-21 20:23:30,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=833934.0, ans=0.125 2023-06-21 20:23:34,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=833934.0, ans=0.04949747468305833 2023-06-21 20:23:38,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-21 20:25:19,787 INFO [train.py:996] (0/4) Epoch 5, batch 17050, loss[loss=0.2397, simple_loss=0.3332, pruned_loss=0.07312, over 21853.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3154, pruned_loss=0.08428, over 4289064.58 frames. ], batch size: 316, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:25:59,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=834234.0, ans=0.125 2023-06-21 20:26:02,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-21 20:26:21,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834294.0, ans=0.1 2023-06-21 20:26:25,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.807e+02 3.360e+02 4.026e+02 6.543e+02, threshold=6.720e+02, percent-clipped=2.0 2023-06-21 20:27:33,450 INFO [train.py:996] (0/4) Epoch 5, batch 17100, loss[loss=0.216, simple_loss=0.2825, pruned_loss=0.0748, over 21432.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.315, pruned_loss=0.08533, over 4293109.07 frames. ], batch size: 194, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:27:33,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=834474.0, ans=0.0 2023-06-21 20:28:21,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=834594.0, ans=0.125 2023-06-21 20:28:47,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=834654.0, ans=10.0 2023-06-21 20:28:59,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=834654.0, ans=0.015 2023-06-21 20:29:18,534 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:29:47,410 INFO [train.py:996] (0/4) Epoch 5, batch 17150, loss[loss=0.2082, simple_loss=0.2871, pruned_loss=0.06465, over 21873.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3107, pruned_loss=0.08474, over 4296707.08 frames. ], batch size: 118, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 20:30:25,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=834834.0, ans=0.2 2023-06-21 20:31:03,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.351e+02 2.648e+02 3.061e+02 4.435e+02, threshold=5.296e+02, percent-clipped=0.0 2023-06-21 20:32:12,181 INFO [train.py:996] (0/4) Epoch 5, batch 17200, loss[loss=0.2373, simple_loss=0.3083, pruned_loss=0.08316, over 21881.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3097, pruned_loss=0.08441, over 4292566.04 frames. ], batch size: 371, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 20:32:20,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-21 20:32:44,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=835134.0, ans=0.125 2023-06-21 20:32:51,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=835194.0, ans=0.1 2023-06-21 20:33:09,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=835194.0, ans=0.1 2023-06-21 20:34:22,664 INFO [train.py:996] (0/4) Epoch 5, batch 17250, loss[loss=0.2488, simple_loss=0.326, pruned_loss=0.08576, over 21827.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3133, pruned_loss=0.08595, over 4293440.30 frames. ], batch size: 247, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:34:55,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-21 20:35:09,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=835434.0, ans=0.125 2023-06-21 20:35:33,321 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:35:51,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.701e+02 3.017e+02 3.644e+02 7.802e+02, threshold=6.033e+02, percent-clipped=3.0 2023-06-21 20:36:06,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-21 20:36:30,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-21 20:36:31,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=835614.0, ans=0.125 2023-06-21 20:36:42,225 INFO [train.py:996] (0/4) Epoch 5, batch 17300, loss[loss=0.2931, simple_loss=0.3624, pruned_loss=0.1119, over 21923.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3213, pruned_loss=0.08943, over 4299254.58 frames. ], batch size: 372, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:37:00,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=835674.0, ans=0.07 2023-06-21 20:37:45,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.94 vs. limit=22.5 2023-06-21 20:39:01,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=835974.0, ans=0.125 2023-06-21 20:39:02,692 INFO [train.py:996] (0/4) Epoch 5, batch 17350, loss[loss=0.2206, simple_loss=0.3067, pruned_loss=0.06729, over 21802.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3211, pruned_loss=0.08859, over 4288846.92 frames. ], batch size: 316, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:39:38,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=836034.0, ans=0.125 2023-06-21 20:40:49,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.583e+02 2.886e+02 3.231e+02 4.631e+02, threshold=5.772e+02, percent-clipped=0.0 2023-06-21 20:40:58,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836154.0, ans=0.1 2023-06-21 20:41:27,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-21 20:41:32,782 INFO [train.py:996] (0/4) Epoch 5, batch 17400, loss[loss=0.1858, simple_loss=0.2415, pruned_loss=0.06502, over 21361.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3164, pruned_loss=0.08459, over 4279106.48 frames. ], batch size: 159, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:41:38,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=836274.0, ans=0.0 2023-06-21 20:42:39,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836394.0, ans=0.1 2023-06-21 20:43:06,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=836454.0, ans=0.04949747468305833 2023-06-21 20:43:49,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=836514.0, ans=0.0 2023-06-21 20:44:15,064 INFO [train.py:996] (0/4) Epoch 5, batch 17450, loss[loss=0.1855, simple_loss=0.2694, pruned_loss=0.05082, over 21752.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3148, pruned_loss=0.08253, over 4280673.68 frames. ], batch size: 282, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:44:15,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=836574.0, ans=0.125 2023-06-21 20:44:57,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=836634.0, ans=0.125 2023-06-21 20:44:58,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=836634.0, ans=0.0 2023-06-21 20:45:11,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836694.0, ans=0.1 2023-06-21 20:45:30,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.434e+02 2.809e+02 3.515e+02 5.944e+02, threshold=5.617e+02, percent-clipped=1.0 2023-06-21 20:45:56,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836814.0, ans=0.1 2023-06-21 20:46:14,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=836814.0, ans=0.0 2023-06-21 20:46:17,024 INFO [train.py:996] (0/4) Epoch 5, batch 17500, loss[loss=0.2614, simple_loss=0.3139, pruned_loss=0.1045, over 21790.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3112, pruned_loss=0.08067, over 4282770.47 frames. ], batch size: 441, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:46:53,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=836874.0, ans=0.015 2023-06-21 20:47:18,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=836994.0, ans=0.2 2023-06-21 20:48:06,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=837114.0, ans=0.0 2023-06-21 20:48:30,509 INFO [train.py:996] (0/4) Epoch 5, batch 17550, loss[loss=0.2104, simple_loss=0.3047, pruned_loss=0.0581, over 21838.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3108, pruned_loss=0.07878, over 4285848.61 frames. ], batch size: 118, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 20:49:50,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.518e+02 2.821e+02 3.568e+02 5.525e+02, threshold=5.643e+02, percent-clipped=0.0 2023-06-21 20:50:06,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=837414.0, ans=0.125 2023-06-21 20:50:16,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=837414.0, ans=0.05 2023-06-21 20:50:34,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=837414.0, ans=0.0 2023-06-21 20:50:45,968 INFO [train.py:996] (0/4) Epoch 5, batch 17600, loss[loss=0.2643, simple_loss=0.3315, pruned_loss=0.09859, over 21405.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3125, pruned_loss=0.07892, over 4272613.49 frames. ], batch size: 549, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:51:10,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=837534.0, ans=0.125 2023-06-21 20:51:33,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=837594.0, ans=0.0 2023-06-21 20:51:39,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=837594.0, ans=0.125 2023-06-21 20:51:40,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=837594.0, ans=0.2 2023-06-21 20:52:33,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=837714.0, ans=0.125 2023-06-21 20:52:42,137 INFO [train.py:996] (0/4) Epoch 5, batch 17650, loss[loss=0.2103, simple_loss=0.2916, pruned_loss=0.06452, over 21693.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3092, pruned_loss=0.07897, over 4274346.97 frames. ], batch size: 415, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 20:53:30,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=837834.0, ans=0.2 2023-06-21 20:54:08,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.478e+02 2.944e+02 3.611e+02 6.334e+02, threshold=5.887e+02, percent-clipped=3.0 2023-06-21 20:54:11,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=837954.0, ans=0.0 2023-06-21 20:54:14,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=837954.0, ans=0.125 2023-06-21 20:54:22,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=837954.0, ans=0.0 2023-06-21 20:54:51,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-21 20:54:56,283 INFO [train.py:996] (0/4) Epoch 5, batch 17700, loss[loss=0.2198, simple_loss=0.299, pruned_loss=0.07029, over 21276.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3055, pruned_loss=0.07668, over 4274066.39 frames. ], batch size: 176, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 20:55:21,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=838134.0, ans=0.035 2023-06-21 20:55:26,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-21 20:55:53,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=838194.0, ans=0.125 2023-06-21 20:55:59,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=838194.0, ans=0.125 2023-06-21 20:57:30,184 INFO [train.py:996] (0/4) Epoch 5, batch 17750, loss[loss=0.2573, simple_loss=0.3368, pruned_loss=0.08895, over 21677.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3143, pruned_loss=0.08059, over 4278857.07 frames. ], batch size: 351, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 20:57:46,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=838434.0, ans=0.125 2023-06-21 20:58:57,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.544e+02 3.029e+02 3.560e+02 6.664e+02, threshold=6.058e+02, percent-clipped=1.0 2023-06-21 20:59:22,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=838614.0, ans=0.125 2023-06-21 20:59:49,790 INFO [train.py:996] (0/4) Epoch 5, batch 17800, loss[loss=0.2425, simple_loss=0.317, pruned_loss=0.08398, over 20111.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3133, pruned_loss=0.07964, over 4269308.56 frames. ], batch size: 702, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:01:58,529 INFO [train.py:996] (0/4) Epoch 5, batch 17850, loss[loss=0.2554, simple_loss=0.3222, pruned_loss=0.09429, over 21731.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3131, pruned_loss=0.08015, over 4270422.46 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:02:35,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=839034.0, ans=0.1 2023-06-21 21:03:43,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.616e+02 2.932e+02 3.373e+02 4.756e+02, threshold=5.865e+02, percent-clipped=0.0 2023-06-21 21:04:31,037 INFO [train.py:996] (0/4) Epoch 5, batch 17900, loss[loss=0.2523, simple_loss=0.3421, pruned_loss=0.08131, over 21709.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.317, pruned_loss=0.08132, over 4272679.75 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:05:05,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=839334.0, ans=0.1 2023-06-21 21:05:07,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=839334.0, ans=0.125 2023-06-21 21:06:06,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=839394.0, ans=0.2 2023-06-21 21:06:18,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=12.0 2023-06-21 21:07:04,357 INFO [train.py:996] (0/4) Epoch 5, batch 17950, loss[loss=0.2351, simple_loss=0.3406, pruned_loss=0.06483, over 21192.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3185, pruned_loss=0.07906, over 4270234.74 frames. ], batch size: 548, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 21:07:27,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-06-21 21:08:30,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.221e+02 2.558e+02 3.024e+02 5.288e+02, threshold=5.115e+02, percent-clipped=0.0 2023-06-21 21:08:30,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=839754.0, ans=10.0 2023-06-21 21:09:17,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=839874.0, ans=0.05 2023-06-21 21:09:18,118 INFO [train.py:996] (0/4) Epoch 5, batch 18000, loss[loss=0.2114, simple_loss=0.2751, pruned_loss=0.07387, over 21748.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.311, pruned_loss=0.07709, over 4269424.37 frames. ], batch size: 317, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:09:18,119 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 21:10:07,761 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2683, simple_loss=0.365, pruned_loss=0.08582, over 1796401.00 frames. 2023-06-21 21:10:07,763 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 21:10:26,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-21 21:10:52,926 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-140000.pt 2023-06-21 21:11:02,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=839994.0, ans=0.125 2023-06-21 21:11:25,799 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:12:14,554 INFO [train.py:996] (0/4) Epoch 5, batch 18050, loss[loss=0.2157, simple_loss=0.2829, pruned_loss=0.07426, over 21655.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3055, pruned_loss=0.07661, over 4271635.50 frames. ], batch size: 247, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:12:39,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=840174.0, ans=0.125 2023-06-21 21:13:00,692 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:13:37,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.366e+02 2.744e+02 3.197e+02 4.866e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 21:14:06,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=840354.0, ans=0.125 2023-06-21 21:14:24,270 INFO [train.py:996] (0/4) Epoch 5, batch 18100, loss[loss=0.2221, simple_loss=0.3253, pruned_loss=0.0594, over 21555.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3102, pruned_loss=0.07865, over 4263514.24 frames. ], batch size: 230, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 21:16:32,820 INFO [train.py:996] (0/4) Epoch 5, batch 18150, loss[loss=0.2454, simple_loss=0.3079, pruned_loss=0.09147, over 21590.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3113, pruned_loss=0.07903, over 4267118.59 frames. ], batch size: 414, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:17:15,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=840894.0, ans=0.0 2023-06-21 21:17:16,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=840894.0, ans=0.0 2023-06-21 21:17:52,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.463e+02 2.724e+02 3.140e+02 4.706e+02, threshold=5.448e+02, percent-clipped=0.0 2023-06-21 21:17:55,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=840954.0, ans=0.0 2023-06-21 21:18:22,752 INFO [train.py:996] (0/4) Epoch 5, batch 18200, loss[loss=0.2004, simple_loss=0.2759, pruned_loss=0.06249, over 21784.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3075, pruned_loss=0.07896, over 4254663.35 frames. ], batch size: 102, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:19:16,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=841194.0, ans=0.125 2023-06-21 21:19:35,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-21 21:20:10,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=841314.0, ans=0.2 2023-06-21 21:20:11,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=841314.0, ans=0.0 2023-06-21 21:20:27,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=6.0 2023-06-21 21:20:28,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841314.0, ans=0.1 2023-06-21 21:20:36,979 INFO [train.py:996] (0/4) Epoch 5, batch 18250, loss[loss=0.184, simple_loss=0.2533, pruned_loss=0.05732, over 21664.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3003, pruned_loss=0.07575, over 4250683.38 frames. ], batch size: 298, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:20:43,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=841374.0, ans=0.125 2023-06-21 21:21:26,248 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:21:26,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=841434.0, ans=0.125 2023-06-21 21:21:44,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=841494.0, ans=0.0 2023-06-21 21:22:05,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 2.405e+02 2.752e+02 3.303e+02 7.064e+02, threshold=5.504e+02, percent-clipped=4.0 2023-06-21 21:22:07,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=841554.0, ans=0.125 2023-06-21 21:22:37,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=841614.0, ans=0.125 2023-06-21 21:22:40,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-21 21:22:41,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=841614.0, ans=0.035 2023-06-21 21:22:43,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=841614.0, ans=0.0 2023-06-21 21:22:45,505 INFO [train.py:996] (0/4) Epoch 5, batch 18300, loss[loss=0.2839, simple_loss=0.361, pruned_loss=0.1035, over 21711.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3015, pruned_loss=0.07659, over 4254909.93 frames. ], batch size: 441, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 21:23:14,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=841734.0, ans=0.125 2023-06-21 21:23:46,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-21 21:24:07,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.41 vs. limit=10.0 2023-06-21 21:24:44,378 INFO [train.py:996] (0/4) Epoch 5, batch 18350, loss[loss=0.2244, simple_loss=0.3036, pruned_loss=0.07259, over 21252.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3054, pruned_loss=0.07655, over 4253521.96 frames. ], batch size: 548, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 21:25:07,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=841974.0, ans=10.0 2023-06-21 21:26:05,182 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.529e+02 3.098e+02 3.873e+02 6.507e+02, threshold=6.195e+02, percent-clipped=4.0 2023-06-21 21:27:08,524 INFO [train.py:996] (0/4) Epoch 5, batch 18400, loss[loss=0.2163, simple_loss=0.2808, pruned_loss=0.07591, over 21188.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3007, pruned_loss=0.07532, over 4254437.46 frames. ], batch size: 143, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:27:21,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=842274.0, ans=0.125 2023-06-21 21:27:52,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=842334.0, ans=0.125 2023-06-21 21:28:03,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=842394.0, ans=0.2 2023-06-21 21:28:05,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-21 21:28:47,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-21 21:29:21,807 INFO [train.py:996] (0/4) Epoch 5, batch 18450, loss[loss=0.2125, simple_loss=0.3005, pruned_loss=0.06222, over 21596.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2989, pruned_loss=0.07253, over 4259868.90 frames. ], batch size: 442, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:29:50,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=842634.0, ans=0.0 2023-06-21 21:29:56,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842634.0, ans=0.1 2023-06-21 21:29:57,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=842634.0, ans=0.05 2023-06-21 21:30:35,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.187e+02 2.513e+02 3.081e+02 4.801e+02, threshold=5.026e+02, percent-clipped=0.0 2023-06-21 21:30:55,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=842814.0, ans=0.0 2023-06-21 21:31:16,296 INFO [train.py:996] (0/4) Epoch 5, batch 18500, loss[loss=0.1853, simple_loss=0.2559, pruned_loss=0.05737, over 21513.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2935, pruned_loss=0.07138, over 4254694.05 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:31:57,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=842874.0, ans=0.0 2023-06-21 21:31:57,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842874.0, ans=0.1 2023-06-21 21:32:11,660 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:32:21,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=842934.0, ans=0.0 2023-06-21 21:32:34,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-21 21:32:36,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-21 21:33:15,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=843114.0, ans=0.0 2023-06-21 21:33:29,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=843114.0, ans=0.0 2023-06-21 21:33:38,482 INFO [train.py:996] (0/4) Epoch 5, batch 18550, loss[loss=0.214, simple_loss=0.2825, pruned_loss=0.07279, over 21929.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.291, pruned_loss=0.07025, over 4254614.03 frames. ], batch size: 113, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 21:33:59,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=843174.0, ans=0.0 2023-06-21 21:34:01,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=843174.0, ans=0.0 2023-06-21 21:34:29,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=843294.0, ans=0.125 2023-06-21 21:35:03,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.298e+02 2.531e+02 2.916e+02 4.382e+02, threshold=5.063e+02, percent-clipped=0.0 2023-06-21 21:35:14,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=843354.0, ans=0.2 2023-06-21 21:35:24,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-21 21:35:30,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=843414.0, ans=0.1 2023-06-21 21:35:39,102 INFO [train.py:996] (0/4) Epoch 5, batch 18600, loss[loss=0.2211, simple_loss=0.2879, pruned_loss=0.07722, over 15464.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2886, pruned_loss=0.0699, over 4253762.51 frames. ], batch size: 60, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:35:54,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=843474.0, ans=0.025 2023-06-21 21:37:31,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=22.5 2023-06-21 21:37:38,648 INFO [train.py:996] (0/4) Epoch 5, batch 18650, loss[loss=0.221, simple_loss=0.3051, pruned_loss=0.06845, over 20785.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2888, pruned_loss=0.07077, over 4251643.86 frames. ], batch size: 608, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:38:32,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=843834.0, ans=0.125 2023-06-21 21:39:03,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=843954.0, ans=0.125 2023-06-21 21:39:04,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.384e+02 2.688e+02 3.191e+02 3.999e+02, threshold=5.375e+02, percent-clipped=0.0 2023-06-21 21:39:11,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-21 21:39:12,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=843954.0, ans=0.125 2023-06-21 21:39:50,080 INFO [train.py:996] (0/4) Epoch 5, batch 18700, loss[loss=0.217, simple_loss=0.2847, pruned_loss=0.07468, over 14516.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2868, pruned_loss=0.07208, over 4257346.69 frames. ], batch size: 60, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:41:05,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.99 vs. limit=15.0 2023-06-21 21:41:29,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=844254.0, ans=0.0 2023-06-21 21:41:36,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=844314.0, ans=0.125 2023-06-21 21:41:57,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-21 21:42:03,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=15.0 2023-06-21 21:42:07,676 INFO [train.py:996] (0/4) Epoch 5, batch 18750, loss[loss=0.2067, simple_loss=0.277, pruned_loss=0.0682, over 21290.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2904, pruned_loss=0.07504, over 4267133.51 frames. ], batch size: 176, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 21:42:36,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-21 21:43:06,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-21 21:43:21,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=844494.0, ans=0.0 2023-06-21 21:43:41,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.502e+02 2.880e+02 3.663e+02 5.516e+02, threshold=5.761e+02, percent-clipped=2.0 2023-06-21 21:43:46,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=844554.0, ans=0.125 2023-06-21 21:44:30,214 INFO [train.py:996] (0/4) Epoch 5, batch 18800, loss[loss=0.2765, simple_loss=0.3562, pruned_loss=0.09835, over 20797.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2976, pruned_loss=0.07786, over 4259434.98 frames. ], batch size: 607, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:45:17,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=844794.0, ans=0.125 2023-06-21 21:46:08,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-21 21:46:09,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=844914.0, ans=0.2 2023-06-21 21:46:32,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-21 21:46:32,353 INFO [train.py:996] (0/4) Epoch 5, batch 18850, loss[loss=0.1965, simple_loss=0.2652, pruned_loss=0.06392, over 21504.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2937, pruned_loss=0.07388, over 4255650.30 frames. ], batch size: 230, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:47:02,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=844974.0, ans=0.125 2023-06-21 21:47:44,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845094.0, ans=0.1 2023-06-21 21:48:09,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 2.173e+02 2.548e+02 3.090e+02 5.403e+02, threshold=5.096e+02, percent-clipped=0.0 2023-06-21 21:48:16,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-21 21:48:47,485 INFO [train.py:996] (0/4) Epoch 5, batch 18900, loss[loss=0.2273, simple_loss=0.2903, pruned_loss=0.08212, over 21221.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2902, pruned_loss=0.07277, over 4256623.42 frames. ], batch size: 159, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:48:56,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=845274.0, ans=0.125 2023-06-21 21:50:10,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-21 21:50:36,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-06-21 21:50:50,447 INFO [train.py:996] (0/4) Epoch 5, batch 18950, loss[loss=0.2707, simple_loss=0.3708, pruned_loss=0.08524, over 21736.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.291, pruned_loss=0.07483, over 4269740.06 frames. ], batch size: 414, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:51:31,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845634.0, ans=0.1 2023-06-21 21:51:49,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-21 21:52:52,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.414e+02 2.698e+02 3.115e+02 5.010e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-21 21:53:21,635 INFO [train.py:996] (0/4) Epoch 5, batch 19000, loss[loss=0.2665, simple_loss=0.3582, pruned_loss=0.08737, over 21706.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2991, pruned_loss=0.07582, over 4271377.92 frames. ], batch size: 416, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 21:53:56,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-21 21:54:08,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=845994.0, ans=0.5 2023-06-21 21:54:59,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-21 21:55:29,394 INFO [train.py:996] (0/4) Epoch 5, batch 19050, loss[loss=0.2367, simple_loss=0.3004, pruned_loss=0.08654, over 21941.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3043, pruned_loss=0.07956, over 4280575.15 frames. ], batch size: 316, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 21:55:40,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-21 21:56:39,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=15.0 2023-06-21 21:56:57,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=846354.0, ans=0.125 2023-06-21 21:57:00,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.693e+02 3.057e+02 3.649e+02 5.381e+02, threshold=6.114e+02, percent-clipped=0.0 2023-06-21 21:57:43,809 INFO [train.py:996] (0/4) Epoch 5, batch 19100, loss[loss=0.2179, simple_loss=0.283, pruned_loss=0.07642, over 21812.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3028, pruned_loss=0.08064, over 4275217.56 frames. ], batch size: 371, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 21:58:07,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=846474.0, ans=0.125 2023-06-21 21:58:31,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=846594.0, ans=15.0 2023-06-21 21:58:55,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=846594.0, ans=0.1 2023-06-21 21:59:19,314 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:59:19,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-21 21:59:55,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.96 vs. limit=10.0 2023-06-21 22:00:10,246 INFO [train.py:996] (0/4) Epoch 5, batch 19150, loss[loss=0.2232, simple_loss=0.3026, pruned_loss=0.07185, over 21372.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.303, pruned_loss=0.08086, over 4272053.01 frames. ], batch size: 194, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 22:01:42,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.638e+02 2.878e+02 3.184e+02 5.125e+02, threshold=5.755e+02, percent-clipped=0.0 2023-06-21 22:01:54,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=847014.0, ans=0.0 2023-06-21 22:02:02,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=847014.0, ans=10.0 2023-06-21 22:02:22,313 INFO [train.py:996] (0/4) Epoch 5, batch 19200, loss[loss=0.2517, simple_loss=0.3493, pruned_loss=0.07704, over 21790.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3133, pruned_loss=0.08134, over 4272634.28 frames. ], batch size: 332, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:02:28,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=847074.0, ans=0.0 2023-06-21 22:03:28,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=847194.0, ans=0.0 2023-06-21 22:03:59,273 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-21 22:04:02,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.37 vs. limit=22.5 2023-06-21 22:04:18,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-21 22:04:20,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=847314.0, ans=0.125 2023-06-21 22:04:42,045 INFO [train.py:996] (0/4) Epoch 5, batch 19250, loss[loss=0.1782, simple_loss=0.2569, pruned_loss=0.04979, over 21427.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3109, pruned_loss=0.07511, over 4278655.26 frames. ], batch size: 131, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:04:56,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=847434.0, ans=0.0 2023-06-21 22:05:21,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=847434.0, ans=0.0 2023-06-21 22:05:47,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=847494.0, ans=0.0 2023-06-21 22:06:30,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.396e+02 2.706e+02 3.299e+02 5.125e+02, threshold=5.412e+02, percent-clipped=0.0 2023-06-21 22:06:39,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=847614.0, ans=0.125 2023-06-21 22:06:51,712 INFO [train.py:996] (0/4) Epoch 5, batch 19300, loss[loss=0.1765, simple_loss=0.2784, pruned_loss=0.03733, over 20827.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3092, pruned_loss=0.07477, over 4281263.85 frames. ], batch size: 608, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:07:43,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=847794.0, ans=0.2 2023-06-21 22:09:03,671 INFO [train.py:996] (0/4) Epoch 5, batch 19350, loss[loss=0.1892, simple_loss=0.2755, pruned_loss=0.0514, over 21747.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3047, pruned_loss=0.07162, over 4285272.59 frames. ], batch size: 282, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:10:27,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=848154.0, ans=0.0 2023-06-21 22:10:49,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.180e+02 2.502e+02 2.789e+02 3.930e+02, threshold=5.004e+02, percent-clipped=0.0 2023-06-21 22:10:49,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=848154.0, ans=0.125 2023-06-21 22:11:01,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=848214.0, ans=0.2 2023-06-21 22:11:12,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=848214.0, ans=0.125 2023-06-21 22:11:16,308 INFO [train.py:996] (0/4) Epoch 5, batch 19400, loss[loss=0.2087, simple_loss=0.2816, pruned_loss=0.06787, over 21639.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.302, pruned_loss=0.07089, over 4289170.35 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:11:29,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=848274.0, ans=0.2 2023-06-21 22:11:29,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=848274.0, ans=0.125 2023-06-21 22:11:44,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=848334.0, ans=0.035 2023-06-21 22:12:16,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=848394.0, ans=0.0 2023-06-21 22:13:26,776 INFO [train.py:996] (0/4) Epoch 5, batch 19450, loss[loss=0.2268, simple_loss=0.285, pruned_loss=0.08425, over 21859.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2997, pruned_loss=0.07342, over 4296347.91 frames. ], batch size: 373, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:13:49,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=848574.0, ans=0.2 2023-06-21 22:14:26,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-21 22:15:03,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.755e+02 3.312e+02 4.071e+02 7.217e+02, threshold=6.625e+02, percent-clipped=11.0 2023-06-21 22:15:21,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 22:15:31,721 INFO [train.py:996] (0/4) Epoch 5, batch 19500, loss[loss=0.1903, simple_loss=0.2535, pruned_loss=0.0635, over 21234.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2956, pruned_loss=0.07407, over 4298140.95 frames. ], batch size: 159, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 22:16:39,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=848994.0, ans=0.125 2023-06-21 22:17:02,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=848994.0, ans=0.0 2023-06-21 22:17:47,561 INFO [train.py:996] (0/4) Epoch 5, batch 19550, loss[loss=0.2565, simple_loss=0.3075, pruned_loss=0.1027, over 20035.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2905, pruned_loss=0.07267, over 4287070.17 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:18:17,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=849234.0, ans=0.125 2023-06-21 22:18:42,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2023-06-21 22:19:24,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-21 22:19:29,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.481e+02 2.782e+02 3.315e+02 6.873e+02, threshold=5.565e+02, percent-clipped=1.0 2023-06-21 22:19:40,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=849414.0, ans=15.0 2023-06-21 22:19:54,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=849414.0, ans=0.09899494936611666 2023-06-21 22:19:55,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=849414.0, ans=0.125 2023-06-21 22:19:59,648 INFO [train.py:996] (0/4) Epoch 5, batch 19600, loss[loss=0.2094, simple_loss=0.2736, pruned_loss=0.0726, over 21171.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2928, pruned_loss=0.07341, over 4293503.53 frames. ], batch size: 608, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:21:52,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=849714.0, ans=0.125 2023-06-21 22:22:29,163 INFO [train.py:996] (0/4) Epoch 5, batch 19650, loss[loss=0.281, simple_loss=0.3337, pruned_loss=0.1141, over 21552.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2995, pruned_loss=0.07806, over 4294800.26 frames. ], batch size: 471, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:23:00,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=849834.0, ans=0.2 2023-06-21 22:23:37,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=849894.0, ans=0.125 2023-06-21 22:23:40,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=849894.0, ans=0.125 2023-06-21 22:24:12,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.670e+02 2.999e+02 3.349e+02 5.989e+02, threshold=5.997e+02, percent-clipped=1.0 2023-06-21 22:24:30,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=850014.0, ans=0.0 2023-06-21 22:24:57,777 INFO [train.py:996] (0/4) Epoch 5, batch 19700, loss[loss=0.2702, simple_loss=0.3575, pruned_loss=0.09144, over 21479.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3033, pruned_loss=0.07939, over 4285035.38 frames. ], batch size: 471, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:26:39,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=850254.0, ans=0.125 2023-06-21 22:27:22,336 INFO [train.py:996] (0/4) Epoch 5, batch 19750, loss[loss=0.2754, simple_loss=0.3603, pruned_loss=0.09525, over 21699.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3119, pruned_loss=0.08027, over 4280460.67 frames. ], batch size: 389, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:28:09,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850434.0, ans=0.1 2023-06-21 22:28:13,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=850434.0, ans=0.2 2023-06-21 22:29:11,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=850554.0, ans=0.125 2023-06-21 22:29:19,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.869e+02 3.746e+02 5.027e+02 9.459e+02, threshold=7.491e+02, percent-clipped=12.0 2023-06-21 22:29:33,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=850614.0, ans=0.125 2023-06-21 22:29:39,220 INFO [train.py:996] (0/4) Epoch 5, batch 19800, loss[loss=0.2112, simple_loss=0.2867, pruned_loss=0.06782, over 21796.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3118, pruned_loss=0.0803, over 4288080.95 frames. ], batch size: 112, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:29:42,753 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:29:51,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=850674.0, ans=0.125 2023-06-21 22:30:35,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=850794.0, ans=0.125 2023-06-21 22:30:57,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=850794.0, ans=0.0 2023-06-21 22:31:24,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=850854.0, ans=0.0 2023-06-21 22:32:11,663 INFO [train.py:996] (0/4) Epoch 5, batch 19850, loss[loss=0.2174, simple_loss=0.2979, pruned_loss=0.06846, over 21461.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3043, pruned_loss=0.07558, over 4288327.39 frames. ], batch size: 194, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 22:32:24,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-21 22:32:35,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=851034.0, ans=0.1 2023-06-21 22:32:36,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=851034.0, ans=0.125 2023-06-21 22:32:47,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=851034.0, ans=0.1 2023-06-21 22:33:06,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-21 22:33:37,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.104e+02 2.344e+02 2.838e+02 4.436e+02, threshold=4.687e+02, percent-clipped=0.0 2023-06-21 22:34:25,315 INFO [train.py:996] (0/4) Epoch 5, batch 19900, loss[loss=0.2013, simple_loss=0.2797, pruned_loss=0.06149, over 21601.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3043, pruned_loss=0.07304, over 4282536.59 frames. ], batch size: 247, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:34:30,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=851274.0, ans=0.5 2023-06-21 22:34:34,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-21 22:35:02,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=851394.0, ans=0.125 2023-06-21 22:35:44,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-21 22:35:52,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=851514.0, ans=0.125 2023-06-21 22:36:28,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=851574.0, ans=0.125 2023-06-21 22:36:29,307 INFO [train.py:996] (0/4) Epoch 5, batch 19950, loss[loss=0.2191, simple_loss=0.3023, pruned_loss=0.06799, over 20698.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2988, pruned_loss=0.07277, over 4281120.56 frames. ], batch size: 607, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 22:37:20,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=851694.0, ans=0.0 2023-06-21 22:38:22,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.543e+02 2.962e+02 3.700e+02 5.630e+02, threshold=5.923e+02, percent-clipped=5.0 2023-06-21 22:38:35,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.01 vs. limit=12.0 2023-06-21 22:38:36,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=851814.0, ans=0.04949747468305833 2023-06-21 22:38:51,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=851874.0, ans=0.2 2023-06-21 22:38:52,203 INFO [train.py:996] (0/4) Epoch 5, batch 20000, loss[loss=0.2374, simple_loss=0.3085, pruned_loss=0.08319, over 21511.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2994, pruned_loss=0.07304, over 4284998.07 frames. ], batch size: 548, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:39:16,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-21 22:39:31,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=851994.0, ans=0.0 2023-06-21 22:40:05,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=852054.0, ans=0.125 2023-06-21 22:40:17,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-21 22:40:31,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=852114.0, ans=0.5 2023-06-21 22:40:42,189 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:40:53,538 INFO [train.py:996] (0/4) Epoch 5, batch 20050, loss[loss=0.2468, simple_loss=0.3144, pruned_loss=0.08956, over 21857.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3005, pruned_loss=0.07503, over 4279173.73 frames. ], batch size: 351, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:41:20,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-21 22:41:29,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852234.0, ans=0.0 2023-06-21 22:41:31,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=852234.0, ans=0.09899494936611666 2023-06-21 22:42:24,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.589e+02 2.869e+02 3.352e+02 4.558e+02, threshold=5.737e+02, percent-clipped=0.0 2023-06-21 22:42:47,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-21 22:43:07,085 INFO [train.py:996] (0/4) Epoch 5, batch 20100, loss[loss=0.2523, simple_loss=0.347, pruned_loss=0.07881, over 21783.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3027, pruned_loss=0.07715, over 4287400.54 frames. ], batch size: 282, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:43:54,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=852534.0, ans=0.0 2023-06-21 22:43:54,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=852534.0, ans=0.0 2023-06-21 22:44:24,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=852594.0, ans=0.09899494936611666 2023-06-21 22:44:54,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=852654.0, ans=0.125 2023-06-21 22:45:34,426 INFO [train.py:996] (0/4) Epoch 5, batch 20150, loss[loss=0.2719, simple_loss=0.3366, pruned_loss=0.1037, over 21373.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3145, pruned_loss=0.08158, over 4288067.86 frames. ], batch size: 549, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:46:15,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852834.0, ans=0.0 2023-06-21 22:46:23,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=852894.0, ans=0.125 2023-06-21 22:46:59,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=852954.0, ans=0.125 2023-06-21 22:47:18,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.899e+02 3.546e+02 4.462e+02 7.672e+02, threshold=7.092e+02, percent-clipped=5.0 2023-06-21 22:47:53,461 INFO [train.py:996] (0/4) Epoch 5, batch 20200, loss[loss=0.2408, simple_loss=0.3192, pruned_loss=0.08124, over 21655.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3208, pruned_loss=0.08541, over 4286991.56 frames. ], batch size: 263, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:48:10,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=853074.0, ans=10.0 2023-06-21 22:48:54,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=853134.0, ans=0.125 2023-06-21 22:49:05,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-21 22:49:08,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=853194.0, ans=0.125 2023-06-21 22:49:12,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=853194.0, ans=0.025 2023-06-21 22:50:22,418 INFO [train.py:996] (0/4) Epoch 5, batch 20250, loss[loss=0.2157, simple_loss=0.2942, pruned_loss=0.06865, over 21660.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3211, pruned_loss=0.08373, over 4291820.81 frames. ], batch size: 230, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:51:07,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=853494.0, ans=0.0 2023-06-21 22:51:56,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.435e+02 2.804e+02 3.299e+02 5.136e+02, threshold=5.609e+02, percent-clipped=0.0 2023-06-21 22:52:26,421 INFO [train.py:996] (0/4) Epoch 5, batch 20300, loss[loss=0.2127, simple_loss=0.2989, pruned_loss=0.06331, over 21357.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3182, pruned_loss=0.0802, over 4290585.54 frames. ], batch size: 211, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:52:31,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=853674.0, ans=0.0 2023-06-21 22:53:13,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-21 22:54:19,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=853974.0, ans=0.035 2023-06-21 22:54:20,955 INFO [train.py:996] (0/4) Epoch 5, batch 20350, loss[loss=0.2517, simple_loss=0.3202, pruned_loss=0.09158, over 21616.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3181, pruned_loss=0.08057, over 4278433.67 frames. ], batch size: 263, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 22:55:40,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=854094.0, ans=0.125 2023-06-21 22:55:52,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=854094.0, ans=0.125 2023-06-21 22:55:54,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=854154.0, ans=0.125 2023-06-21 22:56:06,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=854154.0, ans=0.0 2023-06-21 22:56:09,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.362e+02 2.783e+02 3.388e+02 6.347e+02, threshold=5.566e+02, percent-clipped=2.0 2023-06-21 22:56:17,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-21 22:56:42,089 INFO [train.py:996] (0/4) Epoch 5, batch 20400, loss[loss=0.1912, simple_loss=0.2663, pruned_loss=0.05806, over 17250.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3207, pruned_loss=0.0834, over 4261832.99 frames. ], batch size: 62, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 22:58:01,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=854454.0, ans=0.125 2023-06-21 22:58:16,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=854454.0, ans=0.0 2023-06-21 22:58:21,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=8.0 2023-06-21 22:58:57,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-06-21 22:59:03,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-21 22:59:03,938 INFO [train.py:996] (0/4) Epoch 5, batch 20450, loss[loss=0.2467, simple_loss=0.3121, pruned_loss=0.09061, over 21827.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3217, pruned_loss=0.08659, over 4261468.13 frames. ], batch size: 414, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 22:59:08,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=854574.0, ans=0.1 2023-06-21 22:59:10,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=854574.0, ans=0.0 2023-06-21 22:59:23,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=854634.0, ans=0.125 2023-06-21 22:59:45,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.56 vs. limit=15.0 2023-06-21 22:59:57,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=854694.0, ans=0.125 2023-06-21 23:00:12,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=854694.0, ans=0.2 2023-06-21 23:00:34,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=854814.0, ans=0.125 2023-06-21 23:00:35,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.525e+02 2.860e+02 3.466e+02 5.175e+02, threshold=5.721e+02, percent-clipped=0.0 2023-06-21 23:00:55,762 INFO [train.py:996] (0/4) Epoch 5, batch 20500, loss[loss=0.2275, simple_loss=0.2864, pruned_loss=0.08423, over 21336.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3173, pruned_loss=0.08617, over 4259921.77 frames. ], batch size: 144, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:01:53,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=22.5 2023-06-21 23:02:40,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=855114.0, ans=0.125 2023-06-21 23:03:05,511 INFO [train.py:996] (0/4) Epoch 5, batch 20550, loss[loss=0.1908, simple_loss=0.2706, pruned_loss=0.05551, over 17001.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3099, pruned_loss=0.08456, over 4259639.41 frames. ], batch size: 62, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:03:07,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=855174.0, ans=0.125 2023-06-21 23:04:10,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=855294.0, ans=6.0 2023-06-21 23:04:27,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=855294.0, ans=0.0 2023-06-21 23:04:41,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=855354.0, ans=0.0 2023-06-21 23:04:41,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-21 23:04:45,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.439e+02 2.815e+02 3.369e+02 5.545e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-21 23:05:06,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=855414.0, ans=0.125 2023-06-21 23:05:11,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 23:05:16,246 INFO [train.py:996] (0/4) Epoch 5, batch 20600, loss[loss=0.2657, simple_loss=0.3309, pruned_loss=0.1003, over 21856.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3127, pruned_loss=0.08227, over 4254588.24 frames. ], batch size: 371, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:06:30,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=855654.0, ans=0.125 2023-06-21 23:07:15,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-21 23:07:21,862 INFO [train.py:996] (0/4) Epoch 5, batch 20650, loss[loss=0.2114, simple_loss=0.2741, pruned_loss=0.07433, over 21815.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3089, pruned_loss=0.0825, over 4249331.28 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:08:41,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.84 vs. limit=15.0 2023-06-21 23:08:48,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=855954.0, ans=0.2 2023-06-21 23:08:49,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=855954.0, ans=0.125 2023-06-21 23:09:13,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.383e+02 2.744e+02 3.352e+02 4.969e+02, threshold=5.489e+02, percent-clipped=0.0 2023-06-21 23:09:17,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=856014.0, ans=0.125 2023-06-21 23:09:44,581 INFO [train.py:996] (0/4) Epoch 5, batch 20700, loss[loss=0.2901, simple_loss=0.3697, pruned_loss=0.1053, over 21483.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3023, pruned_loss=0.07935, over 4261754.17 frames. ], batch size: 471, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:09:54,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=856074.0, ans=0.125 2023-06-21 23:10:34,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=856194.0, ans=0.125 2023-06-21 23:10:46,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=856254.0, ans=0.125 2023-06-21 23:11:04,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=856254.0, ans=0.0 2023-06-21 23:11:42,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=856314.0, ans=0.125 2023-06-21 23:11:49,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=856314.0, ans=0.025 2023-06-21 23:11:51,748 INFO [train.py:996] (0/4) Epoch 5, batch 20750, loss[loss=0.2564, simple_loss=0.3597, pruned_loss=0.07658, over 21743.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3052, pruned_loss=0.07892, over 4265105.54 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:12:51,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856494.0, ans=0.1 2023-06-21 23:13:19,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=856554.0, ans=0.0 2023-06-21 23:13:40,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=856554.0, ans=0.125 2023-06-21 23:13:46,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.828e+02 3.457e+02 4.836e+02 8.710e+02, threshold=6.913e+02, percent-clipped=21.0 2023-06-21 23:14:13,081 INFO [train.py:996] (0/4) Epoch 5, batch 20800, loss[loss=0.2267, simple_loss=0.2888, pruned_loss=0.08229, over 21829.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3088, pruned_loss=0.07995, over 4268284.48 frames. ], batch size: 352, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 23:14:32,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=856734.0, ans=0.125 2023-06-21 23:15:11,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-21 23:15:25,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=856854.0, ans=0.125 2023-06-21 23:16:02,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=856914.0, ans=0.0 2023-06-21 23:16:24,394 INFO [train.py:996] (0/4) Epoch 5, batch 20850, loss[loss=0.2154, simple_loss=0.2833, pruned_loss=0.07379, over 21851.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3009, pruned_loss=0.07776, over 4265480.30 frames. ], batch size: 98, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 23:18:13,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.467e+02 2.845e+02 3.460e+02 6.189e+02, threshold=5.691e+02, percent-clipped=0.0 2023-06-21 23:18:40,145 INFO [train.py:996] (0/4) Epoch 5, batch 20900, loss[loss=0.211, simple_loss=0.2901, pruned_loss=0.06596, over 21351.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3013, pruned_loss=0.07864, over 4270932.17 frames. ], batch size: 131, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 23:19:15,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=857394.0, ans=0.125 2023-06-21 23:20:11,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-21 23:20:25,861 INFO [train.py:996] (0/4) Epoch 5, batch 20950, loss[loss=0.2219, simple_loss=0.2831, pruned_loss=0.08036, over 21456.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2987, pruned_loss=0.07609, over 4270867.92 frames. ], batch size: 471, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:20:43,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=857574.0, ans=0.5 2023-06-21 23:21:07,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=857634.0, ans=0.0 2023-06-21 23:21:17,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=857694.0, ans=0.0 2023-06-21 23:21:55,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=857754.0, ans=0.125 2023-06-21 23:22:08,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.456e+02 2.757e+02 3.195e+02 6.346e+02, threshold=5.513e+02, percent-clipped=1.0 2023-06-21 23:22:22,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=857814.0, ans=0.125 2023-06-21 23:22:39,128 INFO [train.py:996] (0/4) Epoch 5, batch 21000, loss[loss=0.2071, simple_loss=0.2754, pruned_loss=0.06941, over 21394.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2961, pruned_loss=0.07563, over 4266673.73 frames. ], batch size: 159, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:22:39,129 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 23:23:36,534 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2652, simple_loss=0.3651, pruned_loss=0.08266, over 1796401.00 frames. 2023-06-21 23:23:36,535 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-21 23:23:42,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=857874.0, ans=0.0 2023-06-21 23:25:03,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=858114.0, ans=0.125 2023-06-21 23:25:06,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=858114.0, ans=0.2 2023-06-21 23:25:24,733 INFO [train.py:996] (0/4) Epoch 5, batch 21050, loss[loss=0.2196, simple_loss=0.296, pruned_loss=0.07157, over 21594.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2945, pruned_loss=0.07606, over 4274597.02 frames. ], batch size: 414, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:25:41,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=858174.0, ans=0.2 2023-06-21 23:25:54,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=858234.0, ans=0.1 2023-06-21 23:26:53,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=858354.0, ans=0.2 2023-06-21 23:26:59,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.489e+02 2.963e+02 3.543e+02 5.648e+02, threshold=5.927e+02, percent-clipped=1.0 2023-06-21 23:27:09,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=858414.0, ans=0.0 2023-06-21 23:27:10,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=858414.0, ans=0.0 2023-06-21 23:27:34,815 INFO [train.py:996] (0/4) Epoch 5, batch 21100, loss[loss=0.2273, simple_loss=0.293, pruned_loss=0.08074, over 21913.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2914, pruned_loss=0.07569, over 4268546.94 frames. ], batch size: 107, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:27:41,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=858474.0, ans=0.1 2023-06-21 23:27:42,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=858474.0, ans=0.0 2023-06-21 23:27:57,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=858474.0, ans=0.1 2023-06-21 23:28:17,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=858594.0, ans=0.125 2023-06-21 23:29:43,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=858774.0, ans=0.0 2023-06-21 23:29:44,530 INFO [train.py:996] (0/4) Epoch 5, batch 21150, loss[loss=0.2016, simple_loss=0.2671, pruned_loss=0.06803, over 21092.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2866, pruned_loss=0.07514, over 4269282.23 frames. ], batch size: 176, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:30:22,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=858834.0, ans=0.125 2023-06-21 23:31:29,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.447e+02 2.797e+02 3.334e+02 4.948e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-21 23:31:55,140 INFO [train.py:996] (0/4) Epoch 5, batch 21200, loss[loss=0.2079, simple_loss=0.2504, pruned_loss=0.08266, over 20271.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2828, pruned_loss=0.07435, over 4265328.35 frames. ], batch size: 703, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 23:32:06,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-21 23:33:11,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=859254.0, ans=0.04949747468305833 2023-06-21 23:33:12,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-21 23:33:17,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=859314.0, ans=0.125 2023-06-21 23:34:07,352 INFO [train.py:996] (0/4) Epoch 5, batch 21250, loss[loss=0.2177, simple_loss=0.2812, pruned_loss=0.07714, over 21173.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2813, pruned_loss=0.0747, over 4265089.80 frames. ], batch size: 143, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:34:07,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=859374.0, ans=0.2 2023-06-21 23:34:15,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=859374.0, ans=0.0 2023-06-21 23:34:19,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=859374.0, ans=0.2 2023-06-21 23:34:28,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859434.0, ans=0.0 2023-06-21 23:34:28,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=859434.0, ans=0.2 2023-06-21 23:34:56,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=859494.0, ans=0.125 2023-06-21 23:35:38,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=859614.0, ans=0.0 2023-06-21 23:35:41,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.492e+02 2.834e+02 3.212e+02 4.793e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-21 23:36:01,187 INFO [train.py:996] (0/4) Epoch 5, batch 21300, loss[loss=0.2516, simple_loss=0.3185, pruned_loss=0.09239, over 21878.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2881, pruned_loss=0.07733, over 4268282.99 frames. ], batch size: 414, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:36:36,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=859734.0, ans=0.1 2023-06-21 23:36:44,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-21 23:36:54,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-21 23:37:34,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=859854.0, ans=0.1 2023-06-21 23:37:34,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859854.0, ans=0.0 2023-06-21 23:38:27,907 INFO [train.py:996] (0/4) Epoch 5, batch 21350, loss[loss=0.232, simple_loss=0.3229, pruned_loss=0.07059, over 21268.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2915, pruned_loss=0.07733, over 4263114.68 frames. ], batch size: 548, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 23:38:37,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2023-06-21 23:38:42,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=860034.0, ans=0.2 2023-06-21 23:38:54,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=860034.0, ans=0.125 2023-06-21 23:39:16,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=860094.0, ans=0.125 2023-06-21 23:39:56,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=860214.0, ans=0.125 2023-06-21 23:39:58,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.438e+02 2.773e+02 3.263e+02 5.694e+02, threshold=5.547e+02, percent-clipped=1.0 2023-06-21 23:40:07,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=860214.0, ans=0.125 2023-06-21 23:40:26,318 INFO [train.py:996] (0/4) Epoch 5, batch 21400, loss[loss=0.2884, simple_loss=0.3557, pruned_loss=0.1105, over 21772.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2945, pruned_loss=0.07607, over 4265998.71 frames. ], batch size: 441, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:40:34,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 23:40:34,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-21 23:40:49,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=860334.0, ans=0.0 2023-06-21 23:41:01,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=860334.0, ans=0.2 2023-06-21 23:41:31,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-21 23:41:32,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-21 23:41:59,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=860454.0, ans=0.0 2023-06-21 23:42:45,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=860514.0, ans=0.2 2023-06-21 23:42:50,929 INFO [train.py:996] (0/4) Epoch 5, batch 21450, loss[loss=0.2381, simple_loss=0.3085, pruned_loss=0.08384, over 21885.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.299, pruned_loss=0.07828, over 4274697.50 frames. ], batch size: 107, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:43:26,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860634.0, ans=0.1 2023-06-21 23:43:29,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=860634.0, ans=0.0 2023-06-21 23:43:48,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=860694.0, ans=10.0 2023-06-21 23:44:23,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-21 23:44:24,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=860754.0, ans=0.0 2023-06-21 23:44:40,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.786e+02 3.330e+02 3.846e+02 5.703e+02, threshold=6.661e+02, percent-clipped=1.0 2023-06-21 23:44:41,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-21 23:45:04,752 INFO [train.py:996] (0/4) Epoch 5, batch 21500, loss[loss=0.264, simple_loss=0.351, pruned_loss=0.08845, over 19861.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2976, pruned_loss=0.07949, over 4264017.54 frames. ], batch size: 703, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:45:48,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=860934.0, ans=0.0 2023-06-21 23:46:09,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860994.0, ans=0.1 2023-06-21 23:47:17,731 INFO [train.py:996] (0/4) Epoch 5, batch 21550, loss[loss=0.1995, simple_loss=0.2665, pruned_loss=0.06626, over 21596.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2916, pruned_loss=0.07752, over 4256080.21 frames. ], batch size: 415, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 23:47:24,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=861174.0, ans=0.2 2023-06-21 23:49:12,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.449e+02 2.734e+02 3.185e+02 5.518e+02, threshold=5.467e+02, percent-clipped=0.0 2023-06-21 23:49:13,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=861414.0, ans=0.1 2023-06-21 23:49:17,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=861414.0, ans=0.0 2023-06-21 23:49:24,736 INFO [train.py:996] (0/4) Epoch 5, batch 21600, loss[loss=0.2039, simple_loss=0.2966, pruned_loss=0.05558, over 21649.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2883, pruned_loss=0.07537, over 4251037.75 frames. ], batch size: 298, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:50:01,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=861534.0, ans=0.125 2023-06-21 23:50:26,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861534.0, ans=0.1 2023-06-21 23:50:32,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=861594.0, ans=0.2 2023-06-21 23:50:37,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=861594.0, ans=0.125 2023-06-21 23:50:39,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=861594.0, ans=0.0 2023-06-21 23:51:31,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=861714.0, ans=0.1 2023-06-21 23:51:36,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=22.5 2023-06-21 23:51:38,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-21 23:51:39,802 INFO [train.py:996] (0/4) Epoch 5, batch 21650, loss[loss=0.1864, simple_loss=0.2782, pruned_loss=0.04732, over 21638.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2911, pruned_loss=0.07305, over 4254810.65 frames. ], batch size: 263, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:51:41,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=861774.0, ans=0.0 2023-06-21 23:51:46,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=861774.0, ans=0.125 2023-06-21 23:52:19,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861834.0, ans=0.1 2023-06-21 23:52:58,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=861954.0, ans=0.04949747468305833 2023-06-21 23:53:16,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=861954.0, ans=0.125 2023-06-21 23:53:25,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=861954.0, ans=0.0 2023-06-21 23:53:43,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.320e+02 2.612e+02 3.014e+02 5.606e+02, threshold=5.225e+02, percent-clipped=2.0 2023-06-21 23:53:46,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=862014.0, ans=0.0 2023-06-21 23:53:50,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-21 23:53:52,883 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:53:55,259 INFO [train.py:996] (0/4) Epoch 5, batch 21700, loss[loss=0.2757, simple_loss=0.3092, pruned_loss=0.1211, over 21269.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2903, pruned_loss=0.07115, over 4259649.69 frames. ], batch size: 507, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:54:00,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-21 23:54:09,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=862134.0, ans=0.2 2023-06-21 23:54:11,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-21 23:54:59,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=862194.0, ans=0.5 2023-06-21 23:55:56,729 INFO [train.py:996] (0/4) Epoch 5, batch 21750, loss[loss=0.1891, simple_loss=0.2491, pruned_loss=0.06454, over 21591.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2862, pruned_loss=0.07122, over 4259827.25 frames. ], batch size: 247, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:56:15,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=862374.0, ans=15.0 2023-06-21 23:57:01,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=862494.0, ans=0.125 2023-06-21 23:57:58,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.550e+02 2.899e+02 3.518e+02 5.551e+02, threshold=5.797e+02, percent-clipped=2.0 2023-06-21 23:58:11,064 INFO [train.py:996] (0/4) Epoch 5, batch 21800, loss[loss=0.2273, simple_loss=0.3079, pruned_loss=0.07333, over 21891.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.286, pruned_loss=0.07278, over 4258586.02 frames. ], batch size: 373, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 23:58:11,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=862674.0, ans=0.125 2023-06-21 23:58:11,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=862674.0, ans=0.125 2023-06-21 23:59:01,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.97 vs. limit=6.0 2023-06-21 23:59:15,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=862794.0, ans=0.0 2023-06-21 23:59:18,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=862794.0, ans=0.2 2023-06-21 23:59:23,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-21 23:59:25,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=862854.0, ans=0.125 2023-06-22 00:00:25,047 INFO [train.py:996] (0/4) Epoch 5, batch 21850, loss[loss=0.2297, simple_loss=0.3257, pruned_loss=0.06686, over 21628.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2924, pruned_loss=0.07353, over 4266635.86 frames. ], batch size: 263, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:00:42,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862974.0, ans=0.1 2023-06-22 00:01:27,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=863094.0, ans=0.2 2023-06-22 00:01:50,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-22 00:02:09,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-22 00:02:24,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-22 00:02:28,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=863214.0, ans=0.0 2023-06-22 00:02:32,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.444e+02 2.856e+02 3.585e+02 5.005e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-22 00:02:43,747 INFO [train.py:996] (0/4) Epoch 5, batch 21900, loss[loss=0.2196, simple_loss=0.2956, pruned_loss=0.07177, over 21244.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2914, pruned_loss=0.07415, over 4278942.03 frames. ], batch size: 159, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:04:13,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-22 00:04:56,139 INFO [train.py:996] (0/4) Epoch 5, batch 21950, loss[loss=0.2208, simple_loss=0.2661, pruned_loss=0.08774, over 20164.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2872, pruned_loss=0.07398, over 4281652.89 frames. ], batch size: 703, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:05:06,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-22 00:05:10,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=863574.0, ans=0.125 2023-06-22 00:05:25,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=863634.0, ans=0.1 2023-06-22 00:05:49,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-22 00:05:53,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=8.0 2023-06-22 00:05:55,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=863694.0, ans=0.125 2023-06-22 00:06:45,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.222e+02 2.484e+02 2.757e+02 5.064e+02, threshold=4.969e+02, percent-clipped=0.0 2023-06-22 00:06:46,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-22 00:06:49,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=863814.0, ans=0.0 2023-06-22 00:06:55,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=863814.0, ans=0.125 2023-06-22 00:07:03,553 INFO [train.py:996] (0/4) Epoch 5, batch 22000, loss[loss=0.2658, simple_loss=0.3194, pruned_loss=0.1061, over 21322.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2815, pruned_loss=0.07135, over 4285490.09 frames. ], batch size: 471, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:07:06,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=863874.0, ans=0.0 2023-06-22 00:07:35,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=863934.0, ans=0.125 2023-06-22 00:07:37,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-22 00:07:52,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-22 00:08:01,709 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-144000.pt 2023-06-22 00:08:46,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=864054.0, ans=10.0 2023-06-22 00:09:13,121 INFO [train.py:996] (0/4) Epoch 5, batch 22050, loss[loss=0.1724, simple_loss=0.2517, pruned_loss=0.04657, over 16157.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.286, pruned_loss=0.07256, over 4268554.61 frames. ], batch size: 60, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:09:13,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=864174.0, ans=0.125 2023-06-22 00:09:55,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=864234.0, ans=0.125 2023-06-22 00:11:10,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.782e+02 3.105e+02 3.760e+02 6.556e+02, threshold=6.210e+02, percent-clipped=5.0 2023-06-22 00:11:27,968 INFO [train.py:996] (0/4) Epoch 5, batch 22100, loss[loss=0.2311, simple_loss=0.3047, pruned_loss=0.07876, over 21401.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2971, pruned_loss=0.07763, over 4277157.64 frames. ], batch size: 176, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:11:53,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=864534.0, ans=0.125 2023-06-22 00:12:22,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=864534.0, ans=0.2 2023-06-22 00:13:49,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=864714.0, ans=0.125 2023-06-22 00:13:55,551 INFO [train.py:996] (0/4) Epoch 5, batch 22150, loss[loss=0.2125, simple_loss=0.2853, pruned_loss=0.06984, over 21845.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3005, pruned_loss=0.07933, over 4284766.35 frames. ], batch size: 282, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:15:54,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.802e+02 3.197e+02 3.821e+02 5.658e+02, threshold=6.394e+02, percent-clipped=0.0 2023-06-22 00:16:08,468 INFO [train.py:996] (0/4) Epoch 5, batch 22200, loss[loss=0.243, simple_loss=0.3257, pruned_loss=0.08016, over 21258.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3025, pruned_loss=0.08021, over 4288081.63 frames. ], batch size: 159, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:17:28,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=865194.0, ans=0.07 2023-06-22 00:18:35,914 INFO [train.py:996] (0/4) Epoch 5, batch 22250, loss[loss=0.2347, simple_loss=0.3381, pruned_loss=0.0656, over 20866.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3083, pruned_loss=0.08161, over 4289862.83 frames. ], batch size: 608, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:19:09,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-22 00:19:15,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=865434.0, ans=0.035 2023-06-22 00:20:18,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=865554.0, ans=0.125 2023-06-22 00:20:28,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=865614.0, ans=0.125 2023-06-22 00:20:29,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=865614.0, ans=0.1 2023-06-22 00:20:30,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.644e+02 2.877e+02 3.306e+02 4.671e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-22 00:20:37,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=865614.0, ans=0.125 2023-06-22 00:20:42,609 INFO [train.py:996] (0/4) Epoch 5, batch 22300, loss[loss=0.2378, simple_loss=0.3026, pruned_loss=0.08651, over 21860.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3103, pruned_loss=0.08346, over 4290310.96 frames. ], batch size: 371, lr: 6.08e-03, grad_scale: 32.0 2023-06-22 00:21:12,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=865674.0, ans=0.125 2023-06-22 00:21:13,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=865674.0, ans=0.2 2023-06-22 00:21:19,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-22 00:22:13,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-22 00:22:25,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=865794.0, ans=0.125 2023-06-22 00:22:51,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=865914.0, ans=0.0 2023-06-22 00:23:07,783 INFO [train.py:996] (0/4) Epoch 5, batch 22350, loss[loss=0.22, simple_loss=0.2902, pruned_loss=0.07486, over 21654.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3087, pruned_loss=0.08429, over 4297517.54 frames. ], batch size: 263, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:23:20,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865974.0, ans=0.1 2023-06-22 00:24:23,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-22 00:24:33,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=866154.0, ans=0.04949747468305833 2023-06-22 00:24:47,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=866154.0, ans=0.02 2023-06-22 00:25:08,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.517e+02 2.753e+02 3.146e+02 5.107e+02, threshold=5.506e+02, percent-clipped=0.0 2023-06-22 00:25:40,369 INFO [train.py:996] (0/4) Epoch 5, batch 22400, loss[loss=0.2065, simple_loss=0.3143, pruned_loss=0.04937, over 20846.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3044, pruned_loss=0.0809, over 4286943.75 frames. ], batch size: 607, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:25:40,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=866274.0, ans=0.04949747468305833 2023-06-22 00:25:52,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=866274.0, ans=0.125 2023-06-22 00:26:50,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=866454.0, ans=0.0 2023-06-22 00:27:44,992 INFO [train.py:996] (0/4) Epoch 5, batch 22450, loss[loss=0.1981, simple_loss=0.2567, pruned_loss=0.06972, over 21694.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.298, pruned_loss=0.07908, over 4276268.13 frames. ], batch size: 283, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:28:30,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=866634.0, ans=0.0 2023-06-22 00:29:17,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=866754.0, ans=0.125 2023-06-22 00:29:33,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.555e+02 2.922e+02 4.023e+02 6.050e+02, threshold=5.844e+02, percent-clipped=3.0 2023-06-22 00:29:55,179 INFO [train.py:996] (0/4) Epoch 5, batch 22500, loss[loss=0.235, simple_loss=0.3206, pruned_loss=0.07471, over 21262.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2938, pruned_loss=0.07899, over 4273383.89 frames. ], batch size: 176, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:30:20,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-06-22 00:31:43,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=867054.0, ans=0.125 2023-06-22 00:31:45,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=867054.0, ans=0.0 2023-06-22 00:32:10,855 INFO [train.py:996] (0/4) Epoch 5, batch 22550, loss[loss=0.23, simple_loss=0.3247, pruned_loss=0.06765, over 21615.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2985, pruned_loss=0.07939, over 4277397.22 frames. ], batch size: 263, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:32:36,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=867174.0, ans=0.1 2023-06-22 00:32:43,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=867174.0, ans=0.2 2023-06-22 00:33:15,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=867294.0, ans=0.125 2023-06-22 00:33:33,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=867354.0, ans=10.0 2023-06-22 00:33:43,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=22.5 2023-06-22 00:34:15,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.802e+02 3.271e+02 3.660e+02 5.759e+02, threshold=6.543e+02, percent-clipped=0.0 2023-06-22 00:34:25,678 INFO [train.py:996] (0/4) Epoch 5, batch 22600, loss[loss=0.149, simple_loss=0.1944, pruned_loss=0.05176, over 16642.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3021, pruned_loss=0.08023, over 4279132.36 frames. ], batch size: 63, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:34:59,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=867474.0, ans=0.0 2023-06-22 00:35:37,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=867594.0, ans=0.2 2023-06-22 00:36:17,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=867654.0, ans=0.2 2023-06-22 00:36:24,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=867714.0, ans=0.0 2023-06-22 00:36:30,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-22 00:36:38,638 INFO [train.py:996] (0/4) Epoch 5, batch 22650, loss[loss=0.1954, simple_loss=0.2554, pruned_loss=0.06767, over 21578.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3004, pruned_loss=0.07935, over 4275100.36 frames. ], batch size: 231, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:37:06,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=867774.0, ans=0.0 2023-06-22 00:37:09,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=867774.0, ans=0.1 2023-06-22 00:37:49,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=867894.0, ans=0.07 2023-06-22 00:38:33,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.499e+02 2.940e+02 3.814e+02 6.401e+02, threshold=5.879e+02, percent-clipped=0.0 2023-06-22 00:38:58,680 INFO [train.py:996] (0/4) Epoch 5, batch 22700, loss[loss=0.1887, simple_loss=0.2497, pruned_loss=0.06385, over 21603.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2933, pruned_loss=0.07839, over 4266897.10 frames. ], batch size: 231, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:38:59,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=868074.0, ans=15.0 2023-06-22 00:39:51,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=868194.0, ans=0.125 2023-06-22 00:40:59,327 INFO [train.py:996] (0/4) Epoch 5, batch 22750, loss[loss=0.2624, simple_loss=0.3306, pruned_loss=0.0971, over 21454.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2935, pruned_loss=0.08002, over 4272607.46 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-22 00:41:02,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=868374.0, ans=0.125 2023-06-22 00:41:13,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=868374.0, ans=0.125 2023-06-22 00:41:16,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=868374.0, ans=0.2 2023-06-22 00:42:58,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.771e+02 3.219e+02 3.779e+02 6.245e+02, threshold=6.438e+02, percent-clipped=2.0 2023-06-22 00:43:15,159 INFO [train.py:996] (0/4) Epoch 5, batch 22800, loss[loss=0.2446, simple_loss=0.3165, pruned_loss=0.08637, over 21884.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2977, pruned_loss=0.08211, over 4277687.95 frames. ], batch size: 107, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 00:43:20,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=868674.0, ans=22.5 2023-06-22 00:44:20,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=868794.0, ans=0.125 2023-06-22 00:44:24,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=868854.0, ans=0.2 2023-06-22 00:44:47,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=868914.0, ans=0.0 2023-06-22 00:45:16,015 INFO [train.py:996] (0/4) Epoch 5, batch 22850, loss[loss=0.2103, simple_loss=0.2704, pruned_loss=0.07514, over 21189.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2934, pruned_loss=0.08068, over 4278781.51 frames. ], batch size: 159, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 00:45:44,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=868974.0, ans=0.125 2023-06-22 00:47:09,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=869214.0, ans=0.2 2023-06-22 00:47:36,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.707e+02 3.130e+02 3.568e+02 4.939e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-22 00:47:59,191 INFO [train.py:996] (0/4) Epoch 5, batch 22900, loss[loss=0.2766, simple_loss=0.3771, pruned_loss=0.08803, over 21513.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2969, pruned_loss=0.08054, over 4268242.90 frames. ], batch size: 471, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:49:01,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=869394.0, ans=0.2 2023-06-22 00:50:33,842 INFO [train.py:996] (0/4) Epoch 5, batch 22950, loss[loss=0.2439, simple_loss=0.3466, pruned_loss=0.07059, over 21484.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3102, pruned_loss=0.07883, over 4267388.70 frames. ], batch size: 195, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:50:35,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=869574.0, ans=0.1 2023-06-22 00:50:50,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=869634.0, ans=0.125 2023-06-22 00:50:53,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=869634.0, ans=0.125 2023-06-22 00:51:01,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=869634.0, ans=0.1 2023-06-22 00:52:23,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.567e+02 2.952e+02 3.530e+02 5.726e+02, threshold=5.904e+02, percent-clipped=0.0 2023-06-22 00:52:44,565 INFO [train.py:996] (0/4) Epoch 5, batch 23000, loss[loss=0.2388, simple_loss=0.3055, pruned_loss=0.08601, over 21231.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3092, pruned_loss=0.07676, over 4278327.70 frames. ], batch size: 143, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:52:51,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-22 00:53:00,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=869934.0, ans=0.125 2023-06-22 00:53:51,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-06-22 00:54:01,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-22 00:55:02,670 INFO [train.py:996] (0/4) Epoch 5, batch 23050, loss[loss=0.2762, simple_loss=0.3387, pruned_loss=0.1069, over 21820.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3114, pruned_loss=0.0796, over 4272066.50 frames. ], batch size: 441, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:55:03,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870174.0, ans=0.1 2023-06-22 00:57:08,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=870414.0, ans=0.125 2023-06-22 00:57:09,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.630e+02 3.010e+02 3.445e+02 5.620e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-22 00:57:12,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=870414.0, ans=0.09899494936611666 2023-06-22 00:57:18,360 INFO [train.py:996] (0/4) Epoch 5, batch 23100, loss[loss=0.193, simple_loss=0.2431, pruned_loss=0.07143, over 20789.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3062, pruned_loss=0.08017, over 4273300.66 frames. ], batch size: 609, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:59:29,230 INFO [train.py:996] (0/4) Epoch 5, batch 23150, loss[loss=0.2023, simple_loss=0.2708, pruned_loss=0.06691, over 21808.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.301, pruned_loss=0.07962, over 4279271.78 frames. ], batch size: 298, lr: 6.06e-03, grad_scale: 16.0 2023-06-22 00:59:33,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=870774.0, ans=0.0 2023-06-22 01:00:11,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870834.0, ans=0.1 2023-06-22 01:00:50,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=870954.0, ans=0.0 2023-06-22 01:01:27,613 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:01:30,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.556e+02 2.836e+02 3.479e+02 5.811e+02, threshold=5.672e+02, percent-clipped=0.0 2023-06-22 01:01:35,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=871014.0, ans=0.2 2023-06-22 01:01:38,837 INFO [train.py:996] (0/4) Epoch 5, batch 23200, loss[loss=0.2226, simple_loss=0.288, pruned_loss=0.07863, over 21585.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3002, pruned_loss=0.08034, over 4289770.01 frames. ], batch size: 212, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 01:02:19,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=871134.0, ans=0.2 2023-06-22 01:02:21,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=871134.0, ans=15.0 2023-06-22 01:02:52,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=871194.0, ans=10.0 2023-06-22 01:03:23,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=871254.0, ans=0.125 2023-06-22 01:03:42,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=871314.0, ans=0.125 2023-06-22 01:03:43,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=871314.0, ans=0.125 2023-06-22 01:03:52,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-22 01:03:54,137 INFO [train.py:996] (0/4) Epoch 5, batch 23250, loss[loss=0.2344, simple_loss=0.2987, pruned_loss=0.08507, over 21678.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3001, pruned_loss=0.08162, over 4297008.42 frames. ], batch size: 230, lr: 6.06e-03, grad_scale: 32.0 2023-06-22 01:04:11,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=871434.0, ans=0.0 2023-06-22 01:04:49,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-06-22 01:05:15,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=871554.0, ans=0.125 2023-06-22 01:05:53,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.691e+02 3.046e+02 3.824e+02 6.255e+02, threshold=6.093e+02, percent-clipped=3.0 2023-06-22 01:06:02,688 INFO [train.py:996] (0/4) Epoch 5, batch 23300, loss[loss=0.2514, simple_loss=0.3427, pruned_loss=0.08004, over 21743.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3084, pruned_loss=0.0832, over 4297104.96 frames. ], batch size: 351, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:06:59,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-22 01:07:42,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=871854.0, ans=0.035 2023-06-22 01:08:19,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-22 01:08:37,164 INFO [train.py:996] (0/4) Epoch 5, batch 23350, loss[loss=0.1624, simple_loss=0.2395, pruned_loss=0.04263, over 21426.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.313, pruned_loss=0.08267, over 4278459.03 frames. ], batch size: 211, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:09:08,345 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:09:28,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=872034.0, ans=0.125 2023-06-22 01:09:32,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2023-06-22 01:10:07,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=872154.0, ans=0.0 2023-06-22 01:10:21,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-22 01:10:31,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.322e+02 2.505e+02 2.999e+02 5.635e+02, threshold=5.010e+02, percent-clipped=0.0 2023-06-22 01:10:47,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=872274.0, ans=0.035 2023-06-22 01:10:48,128 INFO [train.py:996] (0/4) Epoch 5, batch 23400, loss[loss=0.2235, simple_loss=0.2937, pruned_loss=0.07664, over 21475.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3065, pruned_loss=0.07899, over 4282111.91 frames. ], batch size: 548, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:10:50,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.63 vs. limit=10.0 2023-06-22 01:11:34,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=15.0 2023-06-22 01:11:51,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872394.0, ans=0.1 2023-06-22 01:13:11,269 INFO [train.py:996] (0/4) Epoch 5, batch 23450, loss[loss=0.3281, simple_loss=0.3621, pruned_loss=0.147, over 21483.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3062, pruned_loss=0.08091, over 4282528.82 frames. ], batch size: 509, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:13:48,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-22 01:13:59,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=872634.0, ans=0.125 2023-06-22 01:15:10,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872814.0, ans=0.1 2023-06-22 01:15:12,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=872814.0, ans=0.0 2023-06-22 01:15:13,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.562e+02 2.957e+02 3.488e+02 7.125e+02, threshold=5.914e+02, percent-clipped=6.0 2023-06-22 01:15:35,302 INFO [train.py:996] (0/4) Epoch 5, batch 23500, loss[loss=0.2169, simple_loss=0.293, pruned_loss=0.07038, over 21846.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.306, pruned_loss=0.08241, over 4282213.64 frames. ], batch size: 124, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:15:40,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-22 01:16:17,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.89 vs. limit=6.0 2023-06-22 01:16:40,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=872994.0, ans=0.125 2023-06-22 01:16:48,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-22 01:17:32,720 INFO [train.py:996] (0/4) Epoch 5, batch 23550, loss[loss=0.2154, simple_loss=0.2813, pruned_loss=0.07474, over 21744.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3016, pruned_loss=0.08168, over 4275982.27 frames. ], batch size: 112, lr: 6.05e-03, grad_scale: 16.0 2023-06-22 01:17:35,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=22.5 2023-06-22 01:18:05,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=873174.0, ans=0.125 2023-06-22 01:18:27,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=873234.0, ans=0.125 2023-06-22 01:18:45,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=873294.0, ans=0.0 2023-06-22 01:19:00,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.89 vs. limit=6.0 2023-06-22 01:19:27,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.576e+02 2.870e+02 3.585e+02 5.605e+02, threshold=5.739e+02, percent-clipped=0.0 2023-06-22 01:19:32,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-22 01:19:45,023 INFO [train.py:996] (0/4) Epoch 5, batch 23600, loss[loss=0.2734, simple_loss=0.3474, pruned_loss=0.09969, over 21347.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3005, pruned_loss=0.08125, over 4271561.64 frames. ], batch size: 159, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:19:45,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=873474.0, ans=0.2 2023-06-22 01:21:23,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.78 vs. limit=10.0 2023-06-22 01:21:49,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=873714.0, ans=0.025 2023-06-22 01:22:05,162 INFO [train.py:996] (0/4) Epoch 5, batch 23650, loss[loss=0.2251, simple_loss=0.2892, pruned_loss=0.08051, over 20093.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3015, pruned_loss=0.07998, over 4269313.03 frames. ], batch size: 704, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:23:04,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873894.0, ans=0.1 2023-06-22 01:23:41,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=873954.0, ans=0.125 2023-06-22 01:23:43,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=873954.0, ans=0.0 2023-06-22 01:23:45,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.65 vs. limit=6.0 2023-06-22 01:23:46,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=873954.0, ans=0.0 2023-06-22 01:24:28,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.418e+02 2.647e+02 3.193e+02 5.088e+02, threshold=5.293e+02, percent-clipped=0.0 2023-06-22 01:24:41,717 INFO [train.py:996] (0/4) Epoch 5, batch 23700, loss[loss=0.256, simple_loss=0.3277, pruned_loss=0.09215, over 21736.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3044, pruned_loss=0.0795, over 4279261.25 frames. ], batch size: 441, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:24:54,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.60 vs. limit=10.0 2023-06-22 01:25:09,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=874134.0, ans=0.025 2023-06-22 01:25:19,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=874194.0, ans=0.125 2023-06-22 01:25:43,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=874194.0, ans=0.05 2023-06-22 01:25:54,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-22 01:26:56,478 INFO [train.py:996] (0/4) Epoch 5, batch 23750, loss[loss=0.2246, simple_loss=0.2969, pruned_loss=0.07616, over 21158.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3077, pruned_loss=0.08041, over 4286716.97 frames. ], batch size: 143, lr: 6.05e-03, grad_scale: 32.0 2023-06-22 01:27:01,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=874374.0, ans=0.125 2023-06-22 01:27:03,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=874374.0, ans=0.0 2023-06-22 01:27:12,155 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:27:53,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=874494.0, ans=10.0 2023-06-22 01:28:22,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=874554.0, ans=0.2 2023-06-22 01:29:02,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.409e+02 2.705e+02 3.135e+02 4.675e+02, threshold=5.410e+02, percent-clipped=0.0 2023-06-22 01:29:06,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=874614.0, ans=0.0 2023-06-22 01:29:10,023 INFO [train.py:996] (0/4) Epoch 5, batch 23800, loss[loss=0.2074, simple_loss=0.2787, pruned_loss=0.06808, over 21775.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3049, pruned_loss=0.07779, over 4279901.13 frames. ], batch size: 124, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:29:41,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=874734.0, ans=0.125 2023-06-22 01:29:55,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=874734.0, ans=0.125 2023-06-22 01:30:36,002 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:31:17,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-22 01:31:28,759 INFO [train.py:996] (0/4) Epoch 5, batch 23850, loss[loss=0.2959, simple_loss=0.4026, pruned_loss=0.09461, over 19760.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3154, pruned_loss=0.08028, over 4278313.90 frames. ], batch size: 702, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:32:26,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-22 01:32:45,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=875094.0, ans=0.125 2023-06-22 01:33:15,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=875154.0, ans=0.2 2023-06-22 01:33:15,140 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:33:39,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.789e+02 3.123e+02 3.786e+02 8.749e+02, threshold=6.247e+02, percent-clipped=6.0 2023-06-22 01:34:00,413 INFO [train.py:996] (0/4) Epoch 5, batch 23900, loss[loss=0.2341, simple_loss=0.3151, pruned_loss=0.07656, over 21730.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3233, pruned_loss=0.08307, over 4282483.85 frames. ], batch size: 351, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:34:35,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-22 01:34:46,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-22 01:35:24,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-22 01:35:27,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=875514.0, ans=0.125 2023-06-22 01:35:54,010 INFO [train.py:996] (0/4) Epoch 5, batch 23950, loss[loss=0.237, simple_loss=0.3082, pruned_loss=0.08288, over 21712.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.316, pruned_loss=0.08212, over 4272227.57 frames. ], batch size: 351, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:36:17,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=875574.0, ans=0.125 2023-06-22 01:36:36,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875634.0, ans=0.1 2023-06-22 01:36:38,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875634.0, ans=0.1 2023-06-22 01:37:27,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=875754.0, ans=0.125 2023-06-22 01:37:39,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-22 01:37:55,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.517e+02 2.934e+02 3.377e+02 5.286e+02, threshold=5.869e+02, percent-clipped=0.0 2023-06-22 01:38:06,339 INFO [train.py:996] (0/4) Epoch 5, batch 24000, loss[loss=0.2394, simple_loss=0.3176, pruned_loss=0.08059, over 21526.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3181, pruned_loss=0.08602, over 4280681.59 frames. ], batch size: 230, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:38:06,340 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 01:38:46,516 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2672, simple_loss=0.3621, pruned_loss=0.08617, over 1796401.00 frames. 2023-06-22 01:38:46,518 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-22 01:38:55,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=875874.0, ans=0.04949747468305833 2023-06-22 01:39:51,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=875994.0, ans=0.0 2023-06-22 01:40:01,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=876054.0, ans=0.0 2023-06-22 01:40:04,873 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:40:46,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876114.0, ans=0.1 2023-06-22 01:41:23,105 INFO [train.py:996] (0/4) Epoch 5, batch 24050, loss[loss=0.2281, simple_loss=0.3169, pruned_loss=0.06967, over 21281.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3199, pruned_loss=0.08631, over 4274031.51 frames. ], batch size: 548, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:41:25,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.13 vs. limit=10.0 2023-06-22 01:41:55,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=876234.0, ans=0.125 2023-06-22 01:42:03,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-22 01:42:04,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=876234.0, ans=0.2 2023-06-22 01:42:10,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876294.0, ans=0.1 2023-06-22 01:43:24,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-22 01:43:24,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.535e+02 2.835e+02 3.239e+02 5.691e+02, threshold=5.670e+02, percent-clipped=0.0 2023-06-22 01:43:36,732 INFO [train.py:996] (0/4) Epoch 5, batch 24100, loss[loss=0.2529, simple_loss=0.3235, pruned_loss=0.09116, over 21407.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3202, pruned_loss=0.08471, over 4273255.57 frames. ], batch size: 194, lr: 6.04e-03, grad_scale: 32.0 2023-06-22 01:43:36,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=876474.0, ans=0.125 2023-06-22 01:43:47,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=876474.0, ans=0.2 2023-06-22 01:44:21,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=22.5 2023-06-22 01:44:55,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-22 01:45:49,886 INFO [train.py:996] (0/4) Epoch 5, batch 24150, loss[loss=0.3094, simple_loss=0.3497, pruned_loss=0.1345, over 21709.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3194, pruned_loss=0.08634, over 4282738.83 frames. ], batch size: 507, lr: 6.04e-03, grad_scale: 16.0 2023-06-22 01:48:02,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.649e+02 3.023e+02 3.593e+02 4.486e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-22 01:48:04,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.23 vs. limit=10.0 2023-06-22 01:48:06,736 INFO [train.py:996] (0/4) Epoch 5, batch 24200, loss[loss=0.2556, simple_loss=0.3407, pruned_loss=0.0852, over 21615.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3211, pruned_loss=0.08751, over 4285269.37 frames. ], batch size: 389, lr: 6.04e-03, grad_scale: 16.0 2023-06-22 01:48:24,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-22 01:48:43,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-22 01:50:05,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=877314.0, ans=0.0 2023-06-22 01:50:13,916 INFO [train.py:996] (0/4) Epoch 5, batch 24250, loss[loss=0.1969, simple_loss=0.2986, pruned_loss=0.04763, over 21698.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3181, pruned_loss=0.08143, over 4278534.90 frames. ], batch size: 247, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:50:15,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=877374.0, ans=0.0 2023-06-22 01:52:32,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.979e+02 2.289e+02 2.942e+02 4.339e+02, threshold=4.579e+02, percent-clipped=0.0 2023-06-22 01:52:32,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=877614.0, ans=0.125 2023-06-22 01:52:36,816 INFO [train.py:996] (0/4) Epoch 5, batch 24300, loss[loss=0.1914, simple_loss=0.277, pruned_loss=0.05287, over 21572.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3095, pruned_loss=0.0754, over 4270224.92 frames. ], batch size: 441, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:52:40,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=877674.0, ans=0.125 2023-06-22 01:53:07,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=22.5 2023-06-22 01:54:00,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=877794.0, ans=0.0 2023-06-22 01:54:43,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=877914.0, ans=0.0 2023-06-22 01:54:56,430 INFO [train.py:996] (0/4) Epoch 5, batch 24350, loss[loss=0.2434, simple_loss=0.3096, pruned_loss=0.0886, over 21862.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3066, pruned_loss=0.07594, over 4277964.92 frames. ], batch size: 371, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 01:55:06,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=877974.0, ans=0.0 2023-06-22 01:56:38,581 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-22 01:57:07,941 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.557e+02 2.877e+02 3.455e+02 4.976e+02, threshold=5.754e+02, percent-clipped=3.0 2023-06-22 01:57:18,435 INFO [train.py:996] (0/4) Epoch 5, batch 24400, loss[loss=0.2671, simple_loss=0.3796, pruned_loss=0.07733, over 19666.00 frames. ], tot_loss[loss=0.236, simple_loss=0.312, pruned_loss=0.08001, over 4273171.72 frames. ], batch size: 702, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 01:58:14,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=878334.0, ans=0.2 2023-06-22 01:58:34,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-22 01:58:49,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=878394.0, ans=0.125 2023-06-22 01:59:18,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=878514.0, ans=0.2 2023-06-22 01:59:38,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=878514.0, ans=0.0 2023-06-22 01:59:40,609 INFO [train.py:996] (0/4) Epoch 5, batch 24450, loss[loss=0.2111, simple_loss=0.2881, pruned_loss=0.06701, over 21398.00 frames. ], tot_loss[loss=0.238, simple_loss=0.314, pruned_loss=0.08097, over 4269944.45 frames. ], batch size: 194, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:00:00,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=878574.0, ans=0.2 2023-06-22 02:00:13,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-22 02:00:28,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=878634.0, ans=0.04949747468305833 2023-06-22 02:00:58,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=878694.0, ans=0.95 2023-06-22 02:01:18,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-22 02:01:20,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=878814.0, ans=0.2 2023-06-22 02:01:40,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=878814.0, ans=0.2 2023-06-22 02:01:46,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.690e+02 3.074e+02 4.098e+02 6.316e+02, threshold=6.149e+02, percent-clipped=3.0 2023-06-22 02:01:47,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=878814.0, ans=0.035 2023-06-22 02:02:00,002 INFO [train.py:996] (0/4) Epoch 5, batch 24500, loss[loss=0.2716, simple_loss=0.3295, pruned_loss=0.1068, over 21609.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3133, pruned_loss=0.08045, over 4271400.69 frames. ], batch size: 471, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:02:01,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=878874.0, ans=0.0 2023-06-22 02:02:34,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=878934.0, ans=0.125 2023-06-22 02:03:10,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=878994.0, ans=0.05 2023-06-22 02:03:29,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=879054.0, ans=0.0 2023-06-22 02:03:53,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=879114.0, ans=0.0 2023-06-22 02:03:59,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=879114.0, ans=0.125 2023-06-22 02:04:19,108 INFO [train.py:996] (0/4) Epoch 5, batch 24550, loss[loss=0.2704, simple_loss=0.3423, pruned_loss=0.09924, over 21601.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3144, pruned_loss=0.08226, over 4271259.67 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 32.0 2023-06-22 02:04:41,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 02:05:51,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=879354.0, ans=0.0 2023-06-22 02:05:54,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=879354.0, ans=0.125 2023-06-22 02:06:28,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.618e+02 2.906e+02 3.324e+02 4.617e+02, threshold=5.812e+02, percent-clipped=0.0 2023-06-22 02:06:28,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=879414.0, ans=0.125 2023-06-22 02:06:31,238 INFO [train.py:996] (0/4) Epoch 5, batch 24600, loss[loss=0.2709, simple_loss=0.3266, pruned_loss=0.1076, over 21446.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3104, pruned_loss=0.08294, over 4272481.22 frames. ], batch size: 473, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:07:20,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-22 02:08:11,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-22 02:08:16,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-22 02:08:40,857 INFO [train.py:996] (0/4) Epoch 5, batch 24650, loss[loss=0.21, simple_loss=0.2708, pruned_loss=0.07458, over 21851.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.303, pruned_loss=0.08177, over 4267990.89 frames. ], batch size: 373, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:09:03,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-22 02:09:13,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-22 02:10:55,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.542e+02 2.905e+02 3.356e+02 5.377e+02, threshold=5.810e+02, percent-clipped=0.0 2023-06-22 02:10:58,331 INFO [train.py:996] (0/4) Epoch 5, batch 24700, loss[loss=0.2467, simple_loss=0.2976, pruned_loss=0.09792, over 21338.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3018, pruned_loss=0.08051, over 4258789.50 frames. ], batch size: 473, lr: 6.03e-03, grad_scale: 16.0 2023-06-22 02:11:35,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=880134.0, ans=0.125 2023-06-22 02:12:31,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-22 02:12:38,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=880314.0, ans=0.09899494936611666 2023-06-22 02:12:51,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=880314.0, ans=0.2 2023-06-22 02:12:54,958 INFO [train.py:996] (0/4) Epoch 5, batch 24750, loss[loss=0.1935, simple_loss=0.2626, pruned_loss=0.06223, over 21854.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2956, pruned_loss=0.07793, over 4267772.78 frames. ], batch size: 107, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:13:06,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=880374.0, ans=0.0 2023-06-22 02:13:39,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=880434.0, ans=0.0 2023-06-22 02:13:48,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=880434.0, ans=0.125 2023-06-22 02:15:04,034 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.468e+02 2.817e+02 3.299e+02 5.341e+02, threshold=5.634e+02, percent-clipped=0.0 2023-06-22 02:15:06,737 INFO [train.py:996] (0/4) Epoch 5, batch 24800, loss[loss=0.2197, simple_loss=0.2771, pruned_loss=0.0812, over 21489.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.29, pruned_loss=0.07708, over 4270457.43 frames. ], batch size: 195, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:15:42,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=880734.0, ans=0.125 2023-06-22 02:16:13,090 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:16:21,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-22 02:17:22,834 INFO [train.py:996] (0/4) Epoch 5, batch 24850, loss[loss=0.1932, simple_loss=0.2529, pruned_loss=0.06678, over 21310.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2907, pruned_loss=0.07873, over 4280928.58 frames. ], batch size: 176, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:17:26,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-22 02:18:31,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-22 02:18:37,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881094.0, ans=0.1 2023-06-22 02:19:35,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.607e+02 3.027e+02 3.615e+02 6.896e+02, threshold=6.053e+02, percent-clipped=2.0 2023-06-22 02:19:38,481 INFO [train.py:996] (0/4) Epoch 5, batch 24900, loss[loss=0.2419, simple_loss=0.3411, pruned_loss=0.07138, over 20799.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2935, pruned_loss=0.07915, over 4287792.82 frames. ], batch size: 608, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:20:02,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-06-22 02:20:20,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-22 02:20:24,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=881334.0, ans=0.125 2023-06-22 02:20:24,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=881334.0, ans=0.125 2023-06-22 02:21:05,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=881454.0, ans=0.125 2023-06-22 02:21:28,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=881454.0, ans=0.125 2023-06-22 02:22:16,531 INFO [train.py:996] (0/4) Epoch 5, batch 24950, loss[loss=0.2807, simple_loss=0.3448, pruned_loss=0.1083, over 21351.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3031, pruned_loss=0.08406, over 4286218.15 frames. ], batch size: 143, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:23:26,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881694.0, ans=0.1 2023-06-22 02:23:32,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-22 02:24:04,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=22.5 2023-06-22 02:24:34,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.886e+02 3.310e+02 3.841e+02 8.420e+02, threshold=6.620e+02, percent-clipped=6.0 2023-06-22 02:24:35,708 INFO [train.py:996] (0/4) Epoch 5, batch 25000, loss[loss=0.2377, simple_loss=0.3265, pruned_loss=0.07446, over 21947.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3099, pruned_loss=0.08582, over 4276871.50 frames. ], batch size: 317, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:24:51,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-22 02:25:32,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=881994.0, ans=0.0 2023-06-22 02:25:38,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=881994.0, ans=0.2 2023-06-22 02:26:06,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=882054.0, ans=0.125 2023-06-22 02:26:35,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=882114.0, ans=0.09899494936611666 2023-06-22 02:26:41,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=882114.0, ans=0.2 2023-06-22 02:26:44,208 INFO [train.py:996] (0/4) Epoch 5, batch 25050, loss[loss=0.2159, simple_loss=0.2755, pruned_loss=0.07812, over 21290.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3028, pruned_loss=0.08395, over 4281150.09 frames. ], batch size: 144, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:27:15,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-22 02:28:28,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882414.0, ans=0.1 2023-06-22 02:28:57,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.524e+02 2.786e+02 3.350e+02 6.215e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-22 02:28:59,709 INFO [train.py:996] (0/4) Epoch 5, batch 25100, loss[loss=0.2358, simple_loss=0.3088, pruned_loss=0.08139, over 21192.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.296, pruned_loss=0.0817, over 4278485.05 frames. ], batch size: 159, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:29:08,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2023-06-22 02:29:10,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=882474.0, ans=0.04949747468305833 2023-06-22 02:29:19,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882534.0, ans=0.1 2023-06-22 02:29:31,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=882534.0, ans=0.0 2023-06-22 02:30:27,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=882654.0, ans=0.125 2023-06-22 02:30:49,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882714.0, ans=0.1 2023-06-22 02:30:56,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=882714.0, ans=0.125 2023-06-22 02:31:07,289 INFO [train.py:996] (0/4) Epoch 5, batch 25150, loss[loss=0.2159, simple_loss=0.3034, pruned_loss=0.06415, over 21766.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3015, pruned_loss=0.08064, over 4274234.84 frames. ], batch size: 298, lr: 6.02e-03, grad_scale: 16.0 2023-06-22 02:31:22,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=882834.0, ans=0.125 2023-06-22 02:31:25,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882834.0, ans=0.1 2023-06-22 02:31:33,129 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:31:33,253 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:32:09,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-22 02:32:23,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=882954.0, ans=0.2 2023-06-22 02:33:00,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.215e+02 2.479e+02 2.821e+02 4.157e+02, threshold=4.958e+02, percent-clipped=0.0 2023-06-22 02:33:02,329 INFO [train.py:996] (0/4) Epoch 5, batch 25200, loss[loss=0.2208, simple_loss=0.2992, pruned_loss=0.0712, over 16292.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3, pruned_loss=0.07782, over 4266588.51 frames. ], batch size: 62, lr: 6.02e-03, grad_scale: 32.0 2023-06-22 02:34:21,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=883254.0, ans=0.2 2023-06-22 02:35:17,797 INFO [train.py:996] (0/4) Epoch 5, batch 25250, loss[loss=0.2002, simple_loss=0.2624, pruned_loss=0.06897, over 21194.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2976, pruned_loss=0.07576, over 4268031.06 frames. ], batch size: 548, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:35:19,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=883374.0, ans=0.1 2023-06-22 02:35:47,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-22 02:36:48,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=883554.0, ans=0.125 2023-06-22 02:37:20,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.367e+02 2.631e+02 2.996e+02 4.981e+02, threshold=5.263e+02, percent-clipped=1.0 2023-06-22 02:37:27,877 INFO [train.py:996] (0/4) Epoch 5, batch 25300, loss[loss=0.3105, simple_loss=0.3626, pruned_loss=0.1292, over 21373.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2957, pruned_loss=0.07542, over 4260226.41 frames. ], batch size: 508, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:37:36,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-22 02:37:50,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=883734.0, ans=0.1 2023-06-22 02:38:15,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=883794.0, ans=0.125 2023-06-22 02:39:02,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=883854.0, ans=0.2 2023-06-22 02:39:42,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=883914.0, ans=0.0 2023-06-22 02:39:44,740 INFO [train.py:996] (0/4) Epoch 5, batch 25350, loss[loss=0.1715, simple_loss=0.2496, pruned_loss=0.04666, over 21231.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2982, pruned_loss=0.07478, over 4255119.52 frames. ], batch size: 159, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:39:45,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=883974.0, ans=0.125 2023-06-22 02:40:01,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=884034.0, ans=0.5 2023-06-22 02:41:04,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=8.0 2023-06-22 02:41:59,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.361e+02 2.646e+02 3.091e+02 5.117e+02, threshold=5.293e+02, percent-clipped=0.0 2023-06-22 02:41:59,357 INFO [train.py:996] (0/4) Epoch 5, batch 25400, loss[loss=0.2079, simple_loss=0.2763, pruned_loss=0.06975, over 21643.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2927, pruned_loss=0.07363, over 4259162.37 frames. ], batch size: 247, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:42:05,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=884274.0, ans=0.125 2023-06-22 02:42:31,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884334.0, ans=0.1 2023-06-22 02:44:14,162 INFO [train.py:996] (0/4) Epoch 5, batch 25450, loss[loss=0.2705, simple_loss=0.3594, pruned_loss=0.09083, over 21486.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2948, pruned_loss=0.07568, over 4263415.24 frames. ], batch size: 471, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:45:19,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884694.0, ans=0.1 2023-06-22 02:46:22,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=884814.0, ans=0.125 2023-06-22 02:46:30,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.236e+02 2.528e+02 3.001e+02 4.958e+02, threshold=5.055e+02, percent-clipped=0.0 2023-06-22 02:46:30,138 INFO [train.py:996] (0/4) Epoch 5, batch 25500, loss[loss=0.2144, simple_loss=0.2762, pruned_loss=0.07625, over 15584.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2948, pruned_loss=0.07312, over 4244225.69 frames. ], batch size: 62, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:48:10,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=885054.0, ans=0.025 2023-06-22 02:48:33,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=885114.0, ans=0.125 2023-06-22 02:48:34,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=885114.0, ans=0.125 2023-06-22 02:48:42,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=885114.0, ans=0.0 2023-06-22 02:48:47,572 INFO [train.py:996] (0/4) Epoch 5, batch 25550, loss[loss=0.2447, simple_loss=0.3569, pruned_loss=0.06627, over 20724.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3008, pruned_loss=0.07336, over 4243954.49 frames. ], batch size: 607, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:48:48,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.55 vs. limit=15.0 2023-06-22 02:49:01,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-22 02:49:17,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-22 02:50:02,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=885294.0, ans=0.1 2023-06-22 02:50:27,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=885354.0, ans=0.1 2023-06-22 02:50:41,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=885354.0, ans=0.0 2023-06-22 02:51:01,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.415e+02 2.727e+02 3.146e+02 6.002e+02, threshold=5.455e+02, percent-clipped=2.0 2023-06-22 02:51:01,429 INFO [train.py:996] (0/4) Epoch 5, batch 25600, loss[loss=0.2463, simple_loss=0.3232, pruned_loss=0.08471, over 21443.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.305, pruned_loss=0.07422, over 4249178.24 frames. ], batch size: 548, lr: 6.01e-03, grad_scale: 32.0 2023-06-22 02:51:02,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=885474.0, ans=0.0 2023-06-22 02:53:11,096 INFO [train.py:996] (0/4) Epoch 5, batch 25650, loss[loss=0.2177, simple_loss=0.2859, pruned_loss=0.07472, over 21785.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3053, pruned_loss=0.07691, over 4243028.52 frames. ], batch size: 118, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:54:51,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=885954.0, ans=0.0 2023-06-22 02:54:53,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-22 02:55:24,451 INFO [train.py:996] (0/4) Epoch 5, batch 25700, loss[loss=0.2344, simple_loss=0.3136, pruned_loss=0.0776, over 21403.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3033, pruned_loss=0.07848, over 4256790.79 frames. ], batch size: 131, lr: 6.01e-03, grad_scale: 16.0 2023-06-22 02:55:40,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.593e+02 2.983e+02 3.503e+02 5.289e+02, threshold=5.966e+02, percent-clipped=0.0 2023-06-22 02:55:42,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=886074.0, ans=0.125 2023-06-22 02:55:50,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=886074.0, ans=0.125 2023-06-22 02:55:51,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=886074.0, ans=0.0 2023-06-22 02:56:13,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-22 02:56:16,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2023-06-22 02:56:39,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=886194.0, ans=0.125 2023-06-22 02:57:11,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-22 02:57:32,849 INFO [train.py:996] (0/4) Epoch 5, batch 25750, loss[loss=0.3273, simple_loss=0.4105, pruned_loss=0.122, over 21846.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3082, pruned_loss=0.0808, over 4264625.10 frames. ], batch size: 371, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 02:58:19,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-06-22 03:00:20,176 INFO [train.py:996] (0/4) Epoch 5, batch 25800, loss[loss=0.2622, simple_loss=0.3332, pruned_loss=0.09557, over 21441.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3199, pruned_loss=0.08493, over 4272007.43 frames. ], batch size: 211, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:00:21,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.816e+02 3.413e+02 4.279e+02 8.490e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-22 03:00:53,362 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.50 vs. limit=5.0 2023-06-22 03:00:53,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=886734.0, ans=0.0 2023-06-22 03:00:55,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=886734.0, ans=0.125 2023-06-22 03:00:59,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-06-22 03:01:25,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-22 03:02:41,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886974.0, ans=0.1 2023-06-22 03:02:42,722 INFO [train.py:996] (0/4) Epoch 5, batch 25850, loss[loss=0.224, simple_loss=0.2926, pruned_loss=0.07772, over 21470.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3206, pruned_loss=0.08403, over 4277667.42 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:03:22,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-22 03:03:24,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=887094.0, ans=0.09899494936611666 2023-06-22 03:04:58,311 INFO [train.py:996] (0/4) Epoch 5, batch 25900, loss[loss=0.333, simple_loss=0.4156, pruned_loss=0.1252, over 21671.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.322, pruned_loss=0.08465, over 4280208.64 frames. ], batch size: 414, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:04:59,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.721e+02 3.071e+02 3.513e+02 5.338e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-22 03:05:34,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=887334.0, ans=0.125 2023-06-22 03:06:11,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=887394.0, ans=0.125 2023-06-22 03:06:56,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=887514.0, ans=0.125 2023-06-22 03:07:13,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=887514.0, ans=0.1 2023-06-22 03:07:18,014 INFO [train.py:996] (0/4) Epoch 5, batch 25950, loss[loss=0.2331, simple_loss=0.3057, pruned_loss=0.08018, over 20717.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3288, pruned_loss=0.08781, over 4279223.81 frames. ], batch size: 607, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:08:16,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=887694.0, ans=0.125 2023-06-22 03:08:53,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=887754.0, ans=0.0 2023-06-22 03:09:40,574 INFO [train.py:996] (0/4) Epoch 5, batch 26000, loss[loss=0.2653, simple_loss=0.3432, pruned_loss=0.09374, over 21652.00 frames. ], tot_loss[loss=0.249, simple_loss=0.327, pruned_loss=0.0855, over 4272482.95 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:09:42,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.564e+02 2.907e+02 3.379e+02 5.318e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-22 03:10:05,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.40 vs. limit=15.0 2023-06-22 03:10:37,777 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-148000.pt 2023-06-22 03:11:10,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=888054.0, ans=0.0 2023-06-22 03:11:11,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=888054.0, ans=0.0 2023-06-22 03:11:29,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-22 03:11:39,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=888114.0, ans=0.0 2023-06-22 03:11:47,520 INFO [train.py:996] (0/4) Epoch 5, batch 26050, loss[loss=0.3125, simple_loss=0.4077, pruned_loss=0.1086, over 19766.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3265, pruned_loss=0.08623, over 4275006.17 frames. ], batch size: 703, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:12:26,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888234.0, ans=0.1 2023-06-22 03:12:53,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=888294.0, ans=0.125 2023-06-22 03:13:05,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=888294.0, ans=0.2 2023-06-22 03:13:12,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=888354.0, ans=0.0 2023-06-22 03:13:34,319 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:13:41,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=888354.0, ans=0.025 2023-06-22 03:13:53,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=888414.0, ans=0.125 2023-06-22 03:13:58,829 INFO [train.py:996] (0/4) Epoch 5, batch 26100, loss[loss=0.2394, simple_loss=0.3058, pruned_loss=0.08649, over 21890.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3214, pruned_loss=0.08551, over 4282324.74 frames. ], batch size: 371, lr: 6.00e-03, grad_scale: 32.0 2023-06-22 03:14:00,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.634e+02 2.953e+02 3.214e+02 4.969e+02, threshold=5.905e+02, percent-clipped=0.0 2023-06-22 03:14:25,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=888534.0, ans=0.125 2023-06-22 03:15:01,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=888594.0, ans=0.125 2023-06-22 03:16:09,558 INFO [train.py:996] (0/4) Epoch 5, batch 26150, loss[loss=0.2581, simple_loss=0.3284, pruned_loss=0.09391, over 21935.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3188, pruned_loss=0.08628, over 4293087.98 frames. ], batch size: 372, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:16:40,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=888774.0, ans=0.0 2023-06-22 03:18:41,269 INFO [train.py:996] (0/4) Epoch 5, batch 26200, loss[loss=0.2283, simple_loss=0.3277, pruned_loss=0.06449, over 21658.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3193, pruned_loss=0.08488, over 4289068.61 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 16.0 2023-06-22 03:18:48,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.441e+02 2.915e+02 3.551e+02 5.779e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-22 03:19:26,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-22 03:19:50,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=889194.0, ans=0.125 2023-06-22 03:21:09,478 INFO [train.py:996] (0/4) Epoch 5, batch 26250, loss[loss=0.2467, simple_loss=0.3282, pruned_loss=0.08265, over 21758.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3216, pruned_loss=0.0832, over 4289702.75 frames. ], batch size: 112, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:21:17,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=889374.0, ans=0.0 2023-06-22 03:21:19,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-22 03:21:30,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=889434.0, ans=0.0 2023-06-22 03:21:57,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=889434.0, ans=0.125 2023-06-22 03:22:24,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=889494.0, ans=0.125 2023-06-22 03:22:39,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=889554.0, ans=0.125 2023-06-22 03:23:26,708 INFO [train.py:996] (0/4) Epoch 5, batch 26300, loss[loss=0.2374, simple_loss=0.3132, pruned_loss=0.0808, over 21439.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.319, pruned_loss=0.08395, over 4295139.05 frames. ], batch size: 131, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:23:37,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.568e+02 2.838e+02 3.219e+02 5.714e+02, threshold=5.676e+02, percent-clipped=0.0 2023-06-22 03:24:35,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=889794.0, ans=0.125 2023-06-22 03:24:43,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=889794.0, ans=0.0 2023-06-22 03:25:36,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889914.0, ans=0.1 2023-06-22 03:25:53,849 INFO [train.py:996] (0/4) Epoch 5, batch 26350, loss[loss=0.3269, simple_loss=0.3738, pruned_loss=0.14, over 21346.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3174, pruned_loss=0.08433, over 4285415.17 frames. ], batch size: 507, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:26:20,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=890034.0, ans=0.0 2023-06-22 03:26:54,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-22 03:27:04,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=890094.0, ans=0.2 2023-06-22 03:27:04,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890094.0, ans=0.1 2023-06-22 03:27:08,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=890154.0, ans=0.0 2023-06-22 03:27:53,343 INFO [train.py:996] (0/4) Epoch 5, batch 26400, loss[loss=0.1956, simple_loss=0.2566, pruned_loss=0.06728, over 21372.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3117, pruned_loss=0.08433, over 4279362.98 frames. ], batch size: 194, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:27:56,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.457e+02 2.775e+02 3.263e+02 6.072e+02, threshold=5.551e+02, percent-clipped=1.0 2023-06-22 03:28:03,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=890274.0, ans=0.125 2023-06-22 03:28:48,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=890334.0, ans=0.125 2023-06-22 03:29:01,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=890394.0, ans=0.0 2023-06-22 03:29:40,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=890514.0, ans=0.0 2023-06-22 03:29:43,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-22 03:30:28,627 INFO [train.py:996] (0/4) Epoch 5, batch 26450, loss[loss=0.2594, simple_loss=0.3632, pruned_loss=0.07781, over 21719.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.31, pruned_loss=0.08376, over 4273555.97 frames. ], batch size: 332, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:31:51,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=890754.0, ans=0.0 2023-06-22 03:32:41,196 INFO [train.py:996] (0/4) Epoch 5, batch 26500, loss[loss=0.1645, simple_loss=0.2224, pruned_loss=0.05332, over 21838.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3118, pruned_loss=0.08201, over 4272808.46 frames. ], batch size: 107, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:32:44,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.731e+02 3.364e+02 4.078e+02 7.843e+02, threshold=6.728e+02, percent-clipped=9.0 2023-06-22 03:33:16,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=890934.0, ans=0.125 2023-06-22 03:33:19,618 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:33:59,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=890994.0, ans=0.125 2023-06-22 03:34:21,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=891054.0, ans=0.0 2023-06-22 03:35:11,484 INFO [train.py:996] (0/4) Epoch 5, batch 26550, loss[loss=0.2401, simple_loss=0.3347, pruned_loss=0.07276, over 21549.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3076, pruned_loss=0.07898, over 4263436.74 frames. ], batch size: 473, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:36:51,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=891354.0, ans=0.125 2023-06-22 03:37:09,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=891414.0, ans=0.0 2023-06-22 03:37:43,560 INFO [train.py:996] (0/4) Epoch 5, batch 26600, loss[loss=0.2197, simple_loss=0.2908, pruned_loss=0.07429, over 21144.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3068, pruned_loss=0.07588, over 4271671.03 frames. ], batch size: 548, lr: 5.99e-03, grad_scale: 32.0 2023-06-22 03:37:45,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=891474.0, ans=0.07 2023-06-22 03:37:46,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.463e+02 2.685e+02 3.046e+02 4.735e+02, threshold=5.371e+02, percent-clipped=0.0 2023-06-22 03:37:51,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=891474.0, ans=0.07 2023-06-22 03:38:39,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=891594.0, ans=0.07 2023-06-22 03:38:41,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=891594.0, ans=0.0 2023-06-22 03:39:19,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-22 03:39:43,218 INFO [train.py:996] (0/4) Epoch 5, batch 26650, loss[loss=0.2123, simple_loss=0.2895, pruned_loss=0.0676, over 21543.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3001, pruned_loss=0.07489, over 4260331.09 frames. ], batch size: 441, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:39:54,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=891774.0, ans=0.2 2023-06-22 03:41:53,005 INFO [train.py:996] (0/4) Epoch 5, batch 26700, loss[loss=0.2077, simple_loss=0.2759, pruned_loss=0.06974, over 21295.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2945, pruned_loss=0.07294, over 4262635.10 frames. ], batch size: 143, lr: 5.99e-03, grad_scale: 16.0 2023-06-22 03:41:53,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=892074.0, ans=0.125 2023-06-22 03:42:06,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 2.008e+02 2.325e+02 2.767e+02 5.375e+02, threshold=4.650e+02, percent-clipped=1.0 2023-06-22 03:42:16,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-22 03:43:00,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=892194.0, ans=0.125 2023-06-22 03:43:00,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=892194.0, ans=0.125 2023-06-22 03:43:02,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=892194.0, ans=0.125 2023-06-22 03:43:20,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=892254.0, ans=0.1 2023-06-22 03:43:23,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-06-22 03:43:27,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=892254.0, ans=0.125 2023-06-22 03:44:15,905 INFO [train.py:996] (0/4) Epoch 5, batch 26750, loss[loss=0.2055, simple_loss=0.301, pruned_loss=0.05497, over 21759.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2951, pruned_loss=0.07218, over 4265707.64 frames. ], batch size: 351, lr: 5.98e-03, grad_scale: 8.0 2023-06-22 03:45:01,949 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:45:57,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=892554.0, ans=0.125 2023-06-22 03:46:02,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=892554.0, ans=0.2 2023-06-22 03:46:45,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-22 03:46:46,163 INFO [train.py:996] (0/4) Epoch 5, batch 26800, loss[loss=0.2793, simple_loss=0.3411, pruned_loss=0.1088, over 21364.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3038, pruned_loss=0.07724, over 4269297.32 frames. ], batch size: 548, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:46:52,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.542e+02 2.962e+02 3.442e+02 4.612e+02, threshold=5.925e+02, percent-clipped=0.0 2023-06-22 03:48:32,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=892854.0, ans=0.125 2023-06-22 03:48:48,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=892914.0, ans=0.2 2023-06-22 03:48:55,005 INFO [train.py:996] (0/4) Epoch 5, batch 26850, loss[loss=0.2021, simple_loss=0.2644, pruned_loss=0.06993, over 21535.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3063, pruned_loss=0.08, over 4274418.95 frames. ], batch size: 230, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:48:56,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=892974.0, ans=0.0 2023-06-22 03:49:55,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=893094.0, ans=0.125 2023-06-22 03:49:57,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-06-22 03:50:44,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=893214.0, ans=0.2 2023-06-22 03:50:48,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=893214.0, ans=0.125 2023-06-22 03:51:10,811 INFO [train.py:996] (0/4) Epoch 5, batch 26900, loss[loss=0.2032, simple_loss=0.2616, pruned_loss=0.07238, over 21694.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2982, pruned_loss=0.07897, over 4270075.83 frames. ], batch size: 417, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:51:14,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=893274.0, ans=0.0 2023-06-22 03:51:15,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-22 03:51:22,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.626e+02 2.882e+02 3.361e+02 7.434e+02, threshold=5.764e+02, percent-clipped=1.0 2023-06-22 03:51:27,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=893274.0, ans=0.0 2023-06-22 03:51:30,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=893334.0, ans=0.125 2023-06-22 03:51:59,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:52:10,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=893394.0, ans=0.125 2023-06-22 03:53:11,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=893514.0, ans=0.0 2023-06-22 03:53:16,933 INFO [train.py:996] (0/4) Epoch 5, batch 26950, loss[loss=0.1964, simple_loss=0.2889, pruned_loss=0.05191, over 19802.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2979, pruned_loss=0.07892, over 4271767.71 frames. ], batch size: 702, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:54:24,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=893694.0, ans=0.125 2023-06-22 03:55:33,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-22 03:55:45,160 INFO [train.py:996] (0/4) Epoch 5, batch 27000, loss[loss=0.1915, simple_loss=0.2687, pruned_loss=0.05719, over 21517.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2986, pruned_loss=0.0765, over 4269250.92 frames. ], batch size: 195, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:55:45,161 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 03:56:32,881 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2499, simple_loss=0.3437, pruned_loss=0.07804, over 1796401.00 frames. 2023-06-22 03:56:32,882 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-22 03:56:34,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=893874.0, ans=0.0 2023-06-22 03:56:45,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.322e+02 2.675e+02 3.569e+02 6.901e+02, threshold=5.350e+02, percent-clipped=2.0 2023-06-22 03:57:16,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=893994.0, ans=0.0 2023-06-22 03:57:35,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=893994.0, ans=0.125 2023-06-22 03:58:47,702 INFO [train.py:996] (0/4) Epoch 5, batch 27050, loss[loss=0.2357, simple_loss=0.3106, pruned_loss=0.08039, over 21839.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3003, pruned_loss=0.07352, over 4266314.53 frames. ], batch size: 124, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 03:58:50,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-22 04:00:33,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-22 04:00:52,686 INFO [train.py:996] (0/4) Epoch 5, batch 27100, loss[loss=0.2597, simple_loss=0.3469, pruned_loss=0.08629, over 21689.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3029, pruned_loss=0.07415, over 4274917.46 frames. ], batch size: 441, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 04:00:58,763 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.373e+02 2.647e+02 3.210e+02 5.471e+02, threshold=5.294e+02, percent-clipped=1.0 2023-06-22 04:01:17,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=894474.0, ans=0.125 2023-06-22 04:01:53,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=15.0 2023-06-22 04:02:03,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=894594.0, ans=0.125 2023-06-22 04:02:03,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 04:03:16,618 INFO [train.py:996] (0/4) Epoch 5, batch 27150, loss[loss=0.2844, simple_loss=0.3778, pruned_loss=0.0955, over 21642.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3121, pruned_loss=0.07699, over 4277615.76 frames. ], batch size: 389, lr: 5.98e-03, grad_scale: 16.0 2023-06-22 04:03:35,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-22 04:04:00,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 04:04:24,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=894894.0, ans=0.125 2023-06-22 04:04:46,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=894954.0, ans=0.125 2023-06-22 04:05:41,569 INFO [train.py:996] (0/4) Epoch 5, batch 27200, loss[loss=0.267, simple_loss=0.3517, pruned_loss=0.09118, over 21460.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3183, pruned_loss=0.07964, over 4276474.39 frames. ], batch size: 131, lr: 5.98e-03, grad_scale: 32.0 2023-06-22 04:05:43,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=895074.0, ans=0.0 2023-06-22 04:05:46,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=895074.0, ans=0.125 2023-06-22 04:05:47,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.848e+02 3.372e+02 3.981e+02 6.685e+02, threshold=6.744e+02, percent-clipped=3.0 2023-06-22 04:06:01,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=895074.0, ans=0.015 2023-06-22 04:07:06,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-22 04:07:28,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-22 04:08:00,933 INFO [train.py:996] (0/4) Epoch 5, batch 27250, loss[loss=0.2699, simple_loss=0.3327, pruned_loss=0.1036, over 21390.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3225, pruned_loss=0.08367, over 4280351.24 frames. ], batch size: 549, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:08:58,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-22 04:09:25,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-22 04:09:34,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=895494.0, ans=0.125 2023-06-22 04:10:22,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-22 04:10:32,065 INFO [train.py:996] (0/4) Epoch 5, batch 27300, loss[loss=0.223, simple_loss=0.3112, pruned_loss=0.06738, over 21786.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.323, pruned_loss=0.08431, over 4278522.97 frames. ], batch size: 332, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:10:52,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.750e+02 3.166e+02 3.741e+02 6.824e+02, threshold=6.331e+02, percent-clipped=1.0 2023-06-22 04:12:00,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-22 04:12:43,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-22 04:12:44,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=895914.0, ans=0.125 2023-06-22 04:13:20,062 INFO [train.py:996] (0/4) Epoch 5, batch 27350, loss[loss=0.2233, simple_loss=0.3071, pruned_loss=0.06979, over 21908.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3272, pruned_loss=0.08582, over 4276327.91 frames. ], batch size: 316, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:14:10,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-22 04:14:56,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=896154.0, ans=0.0 2023-06-22 04:15:27,416 INFO [train.py:996] (0/4) Epoch 5, batch 27400, loss[loss=0.2153, simple_loss=0.281, pruned_loss=0.0748, over 21682.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3221, pruned_loss=0.0853, over 4280768.23 frames. ], batch size: 230, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:15:45,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.543e+02 2.846e+02 3.223e+02 4.913e+02, threshold=5.692e+02, percent-clipped=0.0 2023-06-22 04:15:50,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=896274.0, ans=0.1 2023-06-22 04:16:41,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896394.0, ans=0.1 2023-06-22 04:17:00,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-22 04:17:49,326 INFO [train.py:996] (0/4) Epoch 5, batch 27450, loss[loss=0.24, simple_loss=0.3264, pruned_loss=0.07684, over 21733.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3169, pruned_loss=0.08359, over 4274576.86 frames. ], batch size: 351, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:18:26,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=896634.0, ans=0.0 2023-06-22 04:18:40,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=896694.0, ans=0.0 2023-06-22 04:18:47,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=896694.0, ans=0.2 2023-06-22 04:19:23,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=896754.0, ans=0.125 2023-06-22 04:20:11,228 INFO [train.py:996] (0/4) Epoch 5, batch 27500, loss[loss=0.2294, simple_loss=0.2956, pruned_loss=0.08163, over 21873.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3153, pruned_loss=0.08408, over 4277295.25 frames. ], batch size: 298, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:20:18,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.586e+02 3.048e+02 3.349e+02 5.042e+02, threshold=6.096e+02, percent-clipped=0.0 2023-06-22 04:20:57,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896934.0, ans=0.1 2023-06-22 04:20:59,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=896994.0, ans=0.0 2023-06-22 04:21:33,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=897054.0, ans=0.1 2023-06-22 04:21:47,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=897054.0, ans=0.125 2023-06-22 04:22:17,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=897174.0, ans=0.0 2023-06-22 04:22:18,497 INFO [train.py:996] (0/4) Epoch 5, batch 27550, loss[loss=0.303, simple_loss=0.4018, pruned_loss=0.1021, over 19858.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3114, pruned_loss=0.08095, over 4272788.27 frames. ], batch size: 702, lr: 5.97e-03, grad_scale: 16.0 2023-06-22 04:22:18,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=897174.0, ans=0.125 2023-06-22 04:22:20,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=897174.0, ans=0.0 2023-06-22 04:22:27,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-22 04:23:12,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=897294.0, ans=0.0 2023-06-22 04:23:56,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-22 04:24:27,659 INFO [train.py:996] (0/4) Epoch 5, batch 27600, loss[loss=0.2127, simple_loss=0.2786, pruned_loss=0.07337, over 21202.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3039, pruned_loss=0.07935, over 4275844.81 frames. ], batch size: 176, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:24:46,034 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.414e+02 2.661e+02 3.209e+02 4.551e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-22 04:25:06,381 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:26:14,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=897714.0, ans=0.95 2023-06-22 04:26:32,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-06-22 04:26:34,747 INFO [train.py:996] (0/4) Epoch 5, batch 27650, loss[loss=0.2241, simple_loss=0.2881, pruned_loss=0.08009, over 21672.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2983, pruned_loss=0.07845, over 4279581.38 frames. ], batch size: 391, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:26:51,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-22 04:27:05,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=897834.0, ans=0.125 2023-06-22 04:27:34,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-22 04:28:17,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=897954.0, ans=0.0 2023-06-22 04:28:50,435 INFO [train.py:996] (0/4) Epoch 5, batch 27700, loss[loss=0.2015, simple_loss=0.2767, pruned_loss=0.06317, over 21274.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2985, pruned_loss=0.07686, over 4279286.18 frames. ], batch size: 159, lr: 5.97e-03, grad_scale: 32.0 2023-06-22 04:28:59,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898074.0, ans=0.1 2023-06-22 04:29:15,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.480e+02 2.768e+02 3.290e+02 5.245e+02, threshold=5.535e+02, percent-clipped=0.0 2023-06-22 04:30:07,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=898194.0, ans=0.0 2023-06-22 04:31:04,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=898314.0, ans=0.2 2023-06-22 04:31:08,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-22 04:31:17,205 INFO [train.py:996] (0/4) Epoch 5, batch 27750, loss[loss=0.3091, simple_loss=0.381, pruned_loss=0.1186, over 21564.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3007, pruned_loss=0.07636, over 4268806.14 frames. ], batch size: 473, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:31:38,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898374.0, ans=0.1 2023-06-22 04:32:00,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898434.0, ans=0.1 2023-06-22 04:32:00,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=898434.0, ans=0.09899494936611666 2023-06-22 04:32:02,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=898494.0, ans=0.125 2023-06-22 04:32:22,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=898494.0, ans=0.2 2023-06-22 04:32:30,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=898494.0, ans=0.125 2023-06-22 04:32:31,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=898494.0, ans=0.125 2023-06-22 04:33:27,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898614.0, ans=0.1 2023-06-22 04:33:33,551 INFO [train.py:996] (0/4) Epoch 5, batch 27800, loss[loss=0.22, simple_loss=0.2857, pruned_loss=0.07719, over 21443.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3001, pruned_loss=0.07697, over 4280292.69 frames. ], batch size: 159, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:33:41,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.543e+02 3.042e+02 3.729e+02 6.528e+02, threshold=6.084e+02, percent-clipped=2.0 2023-06-22 04:34:51,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=898794.0, ans=0.125 2023-06-22 04:34:53,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-22 04:35:03,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=898854.0, ans=0.125 2023-06-22 04:35:03,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=898854.0, ans=0.125 2023-06-22 04:35:43,847 INFO [train.py:996] (0/4) Epoch 5, batch 27850, loss[loss=0.2626, simple_loss=0.3446, pruned_loss=0.09029, over 21773.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3001, pruned_loss=0.07845, over 4285894.29 frames. ], batch size: 414, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:35:47,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898974.0, ans=0.1 2023-06-22 04:36:00,724 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:36:24,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=899034.0, ans=0.0 2023-06-22 04:37:01,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-22 04:37:07,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 04:37:33,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-22 04:38:27,002 INFO [train.py:996] (0/4) Epoch 5, batch 27900, loss[loss=0.2716, simple_loss=0.3846, pruned_loss=0.07928, over 20856.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3091, pruned_loss=0.07939, over 4282013.57 frames. ], batch size: 607, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:38:46,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.475e+02 2.737e+02 3.192e+02 5.724e+02, threshold=5.474e+02, percent-clipped=0.0 2023-06-22 04:38:48,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=899274.0, ans=0.0 2023-06-22 04:39:13,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=899334.0, ans=0.125 2023-06-22 04:40:51,508 INFO [train.py:996] (0/4) Epoch 5, batch 27950, loss[loss=0.2114, simple_loss=0.3045, pruned_loss=0.05917, over 21870.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.31, pruned_loss=0.07676, over 4284505.81 frames. ], batch size: 316, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:41:34,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-22 04:43:07,350 INFO [train.py:996] (0/4) Epoch 5, batch 28000, loss[loss=0.2118, simple_loss=0.3018, pruned_loss=0.06087, over 21349.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3077, pruned_loss=0.07449, over 4281123.22 frames. ], batch size: 548, lr: 5.96e-03, grad_scale: 32.0 2023-06-22 04:43:31,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 2.277e+02 2.720e+02 3.265e+02 5.503e+02, threshold=5.441e+02, percent-clipped=1.0 2023-06-22 04:43:58,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=899934.0, ans=0.0 2023-06-22 04:44:16,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 04:44:41,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=900054.0, ans=0.09899494936611666 2023-06-22 04:45:24,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=900114.0, ans=0.125 2023-06-22 04:45:36,030 INFO [train.py:996] (0/4) Epoch 5, batch 28050, loss[loss=0.1956, simple_loss=0.2582, pruned_loss=0.06653, over 21463.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3046, pruned_loss=0.07523, over 4288220.50 frames. ], batch size: 211, lr: 5.96e-03, grad_scale: 32.0 2023-06-22 04:46:31,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=900234.0, ans=0.125 2023-06-22 04:46:41,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=900294.0, ans=0.125 2023-06-22 04:46:50,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-22 04:47:08,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=900354.0, ans=0.1 2023-06-22 04:47:57,171 INFO [train.py:996] (0/4) Epoch 5, batch 28100, loss[loss=0.2196, simple_loss=0.2789, pruned_loss=0.08016, over 21496.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3009, pruned_loss=0.07485, over 4271896.43 frames. ], batch size: 441, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:48:00,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=900474.0, ans=0.0 2023-06-22 04:48:16,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.684e+02 3.220e+02 3.866e+02 7.727e+02, threshold=6.440e+02, percent-clipped=6.0 2023-06-22 04:49:45,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=900714.0, ans=0.2 2023-06-22 04:49:47,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=900714.0, ans=0.125 2023-06-22 04:50:01,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=900714.0, ans=0.5 2023-06-22 04:50:04,228 INFO [train.py:996] (0/4) Epoch 5, batch 28150, loss[loss=0.2667, simple_loss=0.4046, pruned_loss=0.06438, over 19760.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2954, pruned_loss=0.07465, over 4273831.26 frames. ], batch size: 702, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:50:33,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=900834.0, ans=0.125 2023-06-22 04:52:25,417 INFO [train.py:996] (0/4) Epoch 5, batch 28200, loss[loss=0.2376, simple_loss=0.3036, pruned_loss=0.08578, over 21758.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2929, pruned_loss=0.07591, over 4276779.75 frames. ], batch size: 247, lr: 5.96e-03, grad_scale: 16.0 2023-06-22 04:52:25,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=901074.0, ans=0.2 2023-06-22 04:52:35,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.760e+02 3.168e+02 3.960e+02 6.976e+02, threshold=6.335e+02, percent-clipped=2.0 2023-06-22 04:52:47,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=901134.0, ans=0.1 2023-06-22 04:53:58,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=901254.0, ans=0.1 2023-06-22 04:54:20,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=901314.0, ans=0.125 2023-06-22 04:54:24,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=901314.0, ans=0.125 2023-06-22 04:54:34,810 INFO [train.py:996] (0/4) Epoch 5, batch 28250, loss[loss=0.2296, simple_loss=0.284, pruned_loss=0.08762, over 21203.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2967, pruned_loss=0.07916, over 4272423.86 frames. ], batch size: 159, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:54:35,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=901374.0, ans=0.0 2023-06-22 04:54:56,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=901374.0, ans=0.125 2023-06-22 04:55:11,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=901434.0, ans=0.125 2023-06-22 04:56:35,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=901614.0, ans=0.0 2023-06-22 04:56:53,964 INFO [train.py:996] (0/4) Epoch 5, batch 28300, loss[loss=0.1864, simple_loss=0.2777, pruned_loss=0.04761, over 21574.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2948, pruned_loss=0.07718, over 4266023.65 frames. ], batch size: 230, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 04:57:26,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.516e+02 2.928e+02 3.510e+02 5.631e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-22 04:57:31,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=901734.0, ans=0.95 2023-06-22 04:57:38,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=901734.0, ans=0.125 2023-06-22 04:59:05,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-22 04:59:26,096 INFO [train.py:996] (0/4) Epoch 5, batch 28350, loss[loss=0.1962, simple_loss=0.2591, pruned_loss=0.06669, over 21792.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2929, pruned_loss=0.07225, over 4271042.89 frames. ], batch size: 124, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:01:31,922 INFO [train.py:996] (0/4) Epoch 5, batch 28400, loss[loss=0.2816, simple_loss=0.3419, pruned_loss=0.1107, over 21382.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2905, pruned_loss=0.07239, over 4269307.69 frames. ], batch size: 471, lr: 5.95e-03, grad_scale: 32.0 2023-06-22 05:02:02,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.345e+02 2.616e+02 3.191e+02 6.472e+02, threshold=5.233e+02, percent-clipped=2.0 2023-06-22 05:02:14,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=902334.0, ans=0.035 2023-06-22 05:02:24,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=902334.0, ans=0.125 2023-06-22 05:02:31,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=902394.0, ans=0.04949747468305833 2023-06-22 05:03:26,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=902454.0, ans=0.2 2023-06-22 05:03:56,942 INFO [train.py:996] (0/4) Epoch 5, batch 28450, loss[loss=0.2374, simple_loss=0.3073, pruned_loss=0.08369, over 21423.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2957, pruned_loss=0.07547, over 4261015.48 frames. ], batch size: 194, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:04:35,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=902634.0, ans=0.125 2023-06-22 05:05:02,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-22 05:05:03,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=902694.0, ans=0.0 2023-06-22 05:05:58,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=902814.0, ans=0.0 2023-06-22 05:06:06,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=902814.0, ans=0.125 2023-06-22 05:06:06,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=902814.0, ans=0.0 2023-06-22 05:06:25,819 INFO [train.py:996] (0/4) Epoch 5, batch 28500, loss[loss=0.2333, simple_loss=0.3053, pruned_loss=0.08067, over 21596.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2971, pruned_loss=0.07715, over 4267931.41 frames. ], batch size: 263, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:06:29,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=902874.0, ans=0.2 2023-06-22 05:06:38,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.555e+02 2.878e+02 3.269e+02 4.287e+02, threshold=5.756e+02, percent-clipped=0.0 2023-06-22 05:06:40,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-22 05:06:48,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-22 05:07:04,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-22 05:08:46,878 INFO [train.py:996] (0/4) Epoch 5, batch 28550, loss[loss=0.2322, simple_loss=0.3065, pruned_loss=0.079, over 20741.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3057, pruned_loss=0.08072, over 4272617.45 frames. ], batch size: 608, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:09:52,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-22 05:10:34,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=903414.0, ans=0.2 2023-06-22 05:11:02,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=903414.0, ans=0.125 2023-06-22 05:11:12,575 INFO [train.py:996] (0/4) Epoch 5, batch 28600, loss[loss=0.3273, simple_loss=0.3763, pruned_loss=0.1392, over 21325.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3127, pruned_loss=0.08302, over 4272497.47 frames. ], batch size: 507, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:11:24,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.734e+02 3.060e+02 3.584e+02 6.352e+02, threshold=6.121e+02, percent-clipped=1.0 2023-06-22 05:11:26,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=903534.0, ans=0.125 2023-06-22 05:11:28,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=903534.0, ans=0.125 2023-06-22 05:11:43,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=903534.0, ans=0.125 2023-06-22 05:11:59,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-22 05:12:11,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=903594.0, ans=0.0 2023-06-22 05:13:22,527 INFO [train.py:996] (0/4) Epoch 5, batch 28650, loss[loss=0.225, simple_loss=0.2803, pruned_loss=0.08487, over 21510.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.307, pruned_loss=0.08222, over 4276237.60 frames. ], batch size: 391, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:15:43,976 INFO [train.py:996] (0/4) Epoch 5, batch 28700, loss[loss=0.2021, simple_loss=0.2344, pruned_loss=0.08484, over 20085.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3051, pruned_loss=0.08313, over 4276557.54 frames. ], batch size: 704, lr: 5.95e-03, grad_scale: 16.0 2023-06-22 05:15:51,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=904074.0, ans=0.1 2023-06-22 05:16:01,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.516e+02 2.747e+02 3.102e+02 4.785e+02, threshold=5.493e+02, percent-clipped=0.0 2023-06-22 05:16:01,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=904074.0, ans=0.2 2023-06-22 05:16:03,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=904134.0, ans=0.125 2023-06-22 05:16:47,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=904194.0, ans=0.2 2023-06-22 05:18:02,998 INFO [train.py:996] (0/4) Epoch 5, batch 28750, loss[loss=0.231, simple_loss=0.31, pruned_loss=0.07603, over 21799.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3061, pruned_loss=0.08376, over 4279599.42 frames. ], batch size: 414, lr: 5.94e-03, grad_scale: 16.0 2023-06-22 05:18:55,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=904434.0, ans=0.125 2023-06-22 05:20:32,365 INFO [train.py:996] (0/4) Epoch 5, batch 28800, loss[loss=0.3217, simple_loss=0.3703, pruned_loss=0.1365, over 21437.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3098, pruned_loss=0.08418, over 4279137.66 frames. ], batch size: 471, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:20:33,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-22 05:20:50,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.736e+02 3.061e+02 3.498e+02 5.651e+02, threshold=6.121e+02, percent-clipped=1.0 2023-06-22 05:21:23,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=904794.0, ans=0.2 2023-06-22 05:22:19,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=904914.0, ans=0.125 2023-06-22 05:22:27,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-22 05:23:02,430 INFO [train.py:996] (0/4) Epoch 5, batch 28850, loss[loss=0.2596, simple_loss=0.3115, pruned_loss=0.1039, over 21820.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3116, pruned_loss=0.08554, over 4283734.59 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:23:47,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=905094.0, ans=0.125 2023-06-22 05:25:28,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=905214.0, ans=0.125 2023-06-22 05:25:42,890 INFO [train.py:996] (0/4) Epoch 5, batch 28900, loss[loss=0.232, simple_loss=0.2945, pruned_loss=0.08475, over 21313.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3135, pruned_loss=0.08726, over 4285976.53 frames. ], batch size: 176, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:25:48,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=905274.0, ans=0.035 2023-06-22 05:25:51,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=905274.0, ans=0.1 2023-06-22 05:25:53,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=905274.0, ans=0.1 2023-06-22 05:25:55,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.574e+02 2.990e+02 3.468e+02 6.193e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-22 05:26:45,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=905394.0, ans=0.125 2023-06-22 05:27:45,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=905514.0, ans=0.125 2023-06-22 05:27:54,372 INFO [train.py:996] (0/4) Epoch 5, batch 28950, loss[loss=0.197, simple_loss=0.2633, pruned_loss=0.06534, over 21274.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3135, pruned_loss=0.08626, over 4281170.90 frames. ], batch size: 176, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:28:09,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=905574.0, ans=0.125 2023-06-22 05:28:33,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=905634.0, ans=0.125 2023-06-22 05:28:40,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-22 05:30:22,096 INFO [train.py:996] (0/4) Epoch 5, batch 29000, loss[loss=0.2761, simple_loss=0.3511, pruned_loss=0.1006, over 21787.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3179, pruned_loss=0.08518, over 4279122.82 frames. ], batch size: 124, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:30:46,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.550e+02 2.882e+02 3.345e+02 5.949e+02, threshold=5.765e+02, percent-clipped=1.0 2023-06-22 05:30:55,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=905934.0, ans=0.0 2023-06-22 05:32:11,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=906054.0, ans=0.125 2023-06-22 05:32:22,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=906114.0, ans=0.04949747468305833 2023-06-22 05:32:44,720 INFO [train.py:996] (0/4) Epoch 5, batch 29050, loss[loss=0.2498, simple_loss=0.3144, pruned_loss=0.09258, over 21811.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3162, pruned_loss=0.08569, over 4278624.91 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:34:34,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906414.0, ans=0.1 2023-06-22 05:34:59,591 INFO [train.py:996] (0/4) Epoch 5, batch 29100, loss[loss=0.1834, simple_loss=0.2551, pruned_loss=0.05587, over 21493.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3073, pruned_loss=0.0824, over 4269017.33 frames. ], batch size: 132, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:35:22,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=906474.0, ans=0.2 2023-06-22 05:35:37,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.541e+02 2.908e+02 3.289e+02 5.672e+02, threshold=5.815e+02, percent-clipped=0.0 2023-06-22 05:36:26,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=906654.0, ans=0.0 2023-06-22 05:36:57,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=906714.0, ans=0.0 2023-06-22 05:37:19,650 INFO [train.py:996] (0/4) Epoch 5, batch 29150, loss[loss=0.2537, simple_loss=0.3585, pruned_loss=0.07449, over 20738.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.308, pruned_loss=0.08133, over 4260760.37 frames. ], batch size: 607, lr: 5.94e-03, grad_scale: 16.0 2023-06-22 05:37:44,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-22 05:38:18,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=906894.0, ans=0.0 2023-06-22 05:38:22,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-22 05:39:18,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=22.5 2023-06-22 05:39:23,824 INFO [train.py:996] (0/4) Epoch 5, batch 29200, loss[loss=0.1946, simple_loss=0.2628, pruned_loss=0.0632, over 21550.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3039, pruned_loss=0.08065, over 4252362.88 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 32.0 2023-06-22 05:40:04,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.662e+02 3.109e+02 4.025e+02 6.896e+02, threshold=6.219e+02, percent-clipped=4.0 2023-06-22 05:40:33,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=907194.0, ans=0.0 2023-06-22 05:40:36,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=907194.0, ans=0.125 2023-06-22 05:40:42,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=907194.0, ans=0.1 2023-06-22 05:41:24,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907314.0, ans=0.1 2023-06-22 05:41:28,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907314.0, ans=0.1 2023-06-22 05:41:49,365 INFO [train.py:996] (0/4) Epoch 5, batch 29250, loss[loss=0.1894, simple_loss=0.2681, pruned_loss=0.05536, over 21174.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3013, pruned_loss=0.07835, over 4244344.66 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:42:07,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=907374.0, ans=0.125 2023-06-22 05:42:17,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=907434.0, ans=0.125 2023-06-22 05:42:37,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=907494.0, ans=0.0 2023-06-22 05:42:59,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=907494.0, ans=0.1 2023-06-22 05:43:26,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=907554.0, ans=0.125 2023-06-22 05:44:02,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=907614.0, ans=0.0 2023-06-22 05:44:06,196 INFO [train.py:996] (0/4) Epoch 5, batch 29300, loss[loss=0.211, simple_loss=0.2732, pruned_loss=0.07436, over 19981.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3025, pruned_loss=0.07711, over 4254917.70 frames. ], batch size: 703, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:44:37,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.436e+02 2.713e+02 3.173e+02 4.858e+02, threshold=5.427e+02, percent-clipped=0.0 2023-06-22 05:45:21,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-22 05:46:14,531 INFO [train.py:996] (0/4) Epoch 5, batch 29350, loss[loss=0.2169, simple_loss=0.2756, pruned_loss=0.07907, over 21157.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2975, pruned_loss=0.07631, over 4247320.21 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:46:39,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=907974.0, ans=0.0 2023-06-22 05:46:54,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=908034.0, ans=0.125 2023-06-22 05:47:35,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-22 05:47:43,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-22 05:48:48,312 INFO [train.py:996] (0/4) Epoch 5, batch 29400, loss[loss=0.1919, simple_loss=0.269, pruned_loss=0.05737, over 21712.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2979, pruned_loss=0.07451, over 4252903.28 frames. ], batch size: 298, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:48:51,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=908274.0, ans=0.125 2023-06-22 05:49:08,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.428e+02 2.742e+02 3.233e+02 5.582e+02, threshold=5.484e+02, percent-clipped=1.0 2023-06-22 05:49:41,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=908394.0, ans=0.125 2023-06-22 05:49:53,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=908394.0, ans=0.2 2023-06-22 05:50:08,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-22 05:50:58,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=908514.0, ans=0.0 2023-06-22 05:51:06,630 INFO [train.py:996] (0/4) Epoch 5, batch 29450, loss[loss=0.2329, simple_loss=0.3078, pruned_loss=0.07897, over 20765.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2969, pruned_loss=0.07418, over 4251469.22 frames. ], batch size: 609, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:51:37,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=22.5 2023-06-22 05:51:39,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=908634.0, ans=0.125 2023-06-22 05:52:16,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=908694.0, ans=0.125 2023-06-22 05:52:21,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-22 05:52:39,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=908754.0, ans=0.95 2023-06-22 05:52:51,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=908814.0, ans=0.125 2023-06-22 05:53:20,917 INFO [train.py:996] (0/4) Epoch 5, batch 29500, loss[loss=0.2277, simple_loss=0.2903, pruned_loss=0.08254, over 21938.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3029, pruned_loss=0.07783, over 4262592.33 frames. ], batch size: 333, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:53:22,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=908874.0, ans=0.2 2023-06-22 05:53:27,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=908874.0, ans=0.02 2023-06-22 05:53:32,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=908874.0, ans=0.2 2023-06-22 05:53:33,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-06-22 05:53:33,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.755e+02 3.048e+02 3.720e+02 7.452e+02, threshold=6.096e+02, percent-clipped=3.0 2023-06-22 05:54:09,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=908994.0, ans=0.125 2023-06-22 05:54:36,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=908994.0, ans=0.1 2023-06-22 05:55:13,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=909114.0, ans=0.125 2023-06-22 05:55:30,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-22 05:55:39,408 INFO [train.py:996] (0/4) Epoch 5, batch 29550, loss[loss=0.2285, simple_loss=0.2996, pruned_loss=0.07872, over 21934.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3026, pruned_loss=0.08003, over 4276507.63 frames. ], batch size: 351, lr: 5.93e-03, grad_scale: 16.0 2023-06-22 05:57:33,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=909354.0, ans=0.125 2023-06-22 05:58:01,445 INFO [train.py:996] (0/4) Epoch 5, batch 29600, loss[loss=0.2587, simple_loss=0.3484, pruned_loss=0.08452, over 21819.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3097, pruned_loss=0.08321, over 4279253.56 frames. ], batch size: 282, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 05:58:29,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.694e+02 3.024e+02 3.458e+02 5.010e+02, threshold=6.047e+02, percent-clipped=0.0 2023-06-22 05:59:08,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909594.0, ans=0.1 2023-06-22 05:59:26,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=909654.0, ans=0.125 2023-06-22 05:59:53,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=909654.0, ans=0.0 2023-06-22 06:00:26,805 INFO [train.py:996] (0/4) Epoch 5, batch 29650, loss[loss=0.2541, simple_loss=0.3164, pruned_loss=0.0959, over 21713.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3068, pruned_loss=0.07969, over 4286702.23 frames. ], batch size: 441, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 06:00:43,077 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-22 06:00:49,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=909834.0, ans=0.125 2023-06-22 06:00:59,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=909834.0, ans=0.0 2023-06-22 06:01:26,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=909894.0, ans=0.125 2023-06-22 06:02:24,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 06:02:32,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-22 06:02:32,784 INFO [train.py:996] (0/4) Epoch 5, batch 29700, loss[loss=0.2801, simple_loss=0.3846, pruned_loss=0.08781, over 21698.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3084, pruned_loss=0.07968, over 4283232.47 frames. ], batch size: 389, lr: 5.93e-03, grad_scale: 32.0 2023-06-22 06:03:02,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.342e+02 2.589e+02 3.104e+02 5.027e+02, threshold=5.177e+02, percent-clipped=0.0 2023-06-22 06:03:44,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=910194.0, ans=0.0 2023-06-22 06:03:59,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=910254.0, ans=0.125 2023-06-22 06:04:29,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=910314.0, ans=0.0 2023-06-22 06:04:38,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910314.0, ans=0.1 2023-06-22 06:04:46,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-22 06:04:48,301 INFO [train.py:996] (0/4) Epoch 5, batch 29750, loss[loss=0.2816, simple_loss=0.3705, pruned_loss=0.09638, over 21704.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3146, pruned_loss=0.08029, over 4277698.51 frames. ], batch size: 441, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:05:16,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.68 vs. limit=22.5 2023-06-22 06:05:17,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=910434.0, ans=0.025 2023-06-22 06:05:21,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910434.0, ans=0.1 2023-06-22 06:05:52,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-22 06:05:53,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=910494.0, ans=0.0 2023-06-22 06:06:24,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=910554.0, ans=0.125 2023-06-22 06:06:48,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=910614.0, ans=0.0 2023-06-22 06:07:10,296 INFO [train.py:996] (0/4) Epoch 5, batch 29800, loss[loss=0.2784, simple_loss=0.3295, pruned_loss=0.1137, over 21637.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3142, pruned_loss=0.08013, over 4276962.59 frames. ], batch size: 471, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:07:39,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910734.0, ans=0.1 2023-06-22 06:07:40,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.449e+02 2.699e+02 3.078e+02 3.997e+02, threshold=5.399e+02, percent-clipped=0.0 2023-06-22 06:07:44,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=910734.0, ans=0.0 2023-06-22 06:07:46,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=910734.0, ans=0.125 2023-06-22 06:09:21,742 INFO [train.py:996] (0/4) Epoch 5, batch 29850, loss[loss=0.2078, simple_loss=0.2845, pruned_loss=0.06555, over 21562.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3085, pruned_loss=0.07662, over 4273758.96 frames. ], batch size: 212, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:09:57,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=911034.0, ans=0.04949747468305833 2023-06-22 06:10:06,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=911034.0, ans=0.09899494936611666 2023-06-22 06:10:14,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=911034.0, ans=0.125 2023-06-22 06:10:29,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=911094.0, ans=0.0 2023-06-22 06:11:46,348 INFO [train.py:996] (0/4) Epoch 5, batch 29900, loss[loss=0.296, simple_loss=0.3494, pruned_loss=0.1213, over 21480.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3064, pruned_loss=0.07799, over 4284405.24 frames. ], batch size: 471, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:12:18,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.539e+02 3.015e+02 3.692e+02 5.996e+02, threshold=6.029e+02, percent-clipped=2.0 2023-06-22 06:13:30,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-22 06:13:49,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=911514.0, ans=0.0 2023-06-22 06:14:06,407 INFO [train.py:996] (0/4) Epoch 5, batch 29950, loss[loss=0.2618, simple_loss=0.3272, pruned_loss=0.09822, over 21348.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3102, pruned_loss=0.08127, over 4283403.13 frames. ], batch size: 549, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:15:08,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=911634.0, ans=10.0 2023-06-22 06:15:56,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=911754.0, ans=0.0 2023-06-22 06:15:58,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=911754.0, ans=0.05 2023-06-22 06:16:09,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=911814.0, ans=0.125 2023-06-22 06:16:10,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=911814.0, ans=0.1 2023-06-22 06:16:33,838 INFO [train.py:996] (0/4) Epoch 5, batch 30000, loss[loss=0.2104, simple_loss=0.2914, pruned_loss=0.06474, over 21298.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3119, pruned_loss=0.08194, over 4281235.02 frames. ], batch size: 159, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:16:33,841 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 06:17:02,268 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5040, 3.9741, 3.5829, 2.2229], device='cuda:0') 2023-06-22 06:17:18,835 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2496, simple_loss=0.3465, pruned_loss=0.07629, over 1796401.00 frames. 2023-06-22 06:17:18,836 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-22 06:17:54,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.727e+02 3.282e+02 4.075e+02 7.165e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-22 06:17:55,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-22 06:18:09,624 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-152000.pt 2023-06-22 06:19:08,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-22 06:20:02,386 INFO [train.py:996] (0/4) Epoch 5, batch 30050, loss[loss=0.2273, simple_loss=0.3248, pruned_loss=0.06495, over 21721.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3169, pruned_loss=0.07963, over 4281696.09 frames. ], batch size: 298, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:20:30,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=912234.0, ans=10.0 2023-06-22 06:21:50,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912414.0, ans=0.1 2023-06-22 06:21:52,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=912414.0, ans=0.125 2023-06-22 06:22:04,891 INFO [train.py:996] (0/4) Epoch 5, batch 30100, loss[loss=0.2107, simple_loss=0.2735, pruned_loss=0.07396, over 21767.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3164, pruned_loss=0.07971, over 4285432.86 frames. ], batch size: 118, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:22:09,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-22 06:22:13,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=912474.0, ans=0.0 2023-06-22 06:22:21,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.782e+02 3.150e+02 3.648e+02 6.507e+02, threshold=6.300e+02, percent-clipped=0.0 2023-06-22 06:22:29,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=912534.0, ans=0.2 2023-06-22 06:23:55,574 INFO [train.py:996] (0/4) Epoch 5, batch 30150, loss[loss=0.221, simple_loss=0.2925, pruned_loss=0.07472, over 21751.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.313, pruned_loss=0.08159, over 4283717.30 frames. ], batch size: 282, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:24:42,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=912834.0, ans=15.0 2023-06-22 06:26:30,437 INFO [train.py:996] (0/4) Epoch 5, batch 30200, loss[loss=0.2131, simple_loss=0.3128, pruned_loss=0.05666, over 21757.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3137, pruned_loss=0.0809, over 4278992.51 frames. ], batch size: 282, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:26:59,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.559e+02 2.949e+02 3.535e+02 6.486e+02, threshold=5.897e+02, percent-clipped=1.0 2023-06-22 06:27:26,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-22 06:28:00,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=913194.0, ans=0.0 2023-06-22 06:28:00,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=913194.0, ans=0.125 2023-06-22 06:28:11,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=913254.0, ans=0.125 2023-06-22 06:28:33,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=913314.0, ans=0.1 2023-06-22 06:28:49,075 INFO [train.py:996] (0/4) Epoch 5, batch 30250, loss[loss=0.2815, simple_loss=0.3833, pruned_loss=0.08984, over 21724.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3204, pruned_loss=0.08241, over 4280000.10 frames. ], batch size: 298, lr: 5.92e-03, grad_scale: 32.0 2023-06-22 06:29:15,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-22 06:29:38,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=913434.0, ans=0.0 2023-06-22 06:30:51,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=913614.0, ans=0.125 2023-06-22 06:31:21,057 INFO [train.py:996] (0/4) Epoch 5, batch 30300, loss[loss=0.2128, simple_loss=0.2714, pruned_loss=0.0771, over 21240.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3172, pruned_loss=0.08216, over 4283147.93 frames. ], batch size: 144, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:31:59,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.693e+02 3.085e+02 3.904e+02 6.808e+02, threshold=6.171e+02, percent-clipped=2.0 2023-06-22 06:33:44,925 INFO [train.py:996] (0/4) Epoch 5, batch 30350, loss[loss=0.3212, simple_loss=0.4048, pruned_loss=0.1188, over 21501.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3174, pruned_loss=0.08327, over 4280219.70 frames. ], batch size: 473, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:34:13,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-22 06:34:15,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=914034.0, ans=0.125 2023-06-22 06:36:02,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=914214.0, ans=0.05 2023-06-22 06:36:44,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-22 06:36:45,046 INFO [train.py:996] (0/4) Epoch 5, batch 30400, loss[loss=0.2195, simple_loss=0.2673, pruned_loss=0.08587, over 20228.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3122, pruned_loss=0.08192, over 4270787.65 frames. ], batch size: 703, lr: 5.91e-03, grad_scale: 32.0 2023-06-22 06:37:53,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.876e+02 3.400e+02 4.558e+02 7.300e+02, threshold=6.801e+02, percent-clipped=4.0 2023-06-22 06:38:03,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=914334.0, ans=0.125 2023-06-22 06:38:04,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=914334.0, ans=0.0 2023-06-22 06:38:46,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=914394.0, ans=0.125 2023-06-22 06:39:32,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=914454.0, ans=0.07 2023-06-22 06:40:11,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=914454.0, ans=0.0 2023-06-22 06:41:03,385 INFO [train.py:996] (0/4) Epoch 5, batch 30450, loss[loss=0.3041, simple_loss=0.4114, pruned_loss=0.09842, over 19827.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3132, pruned_loss=0.08202, over 4209318.88 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 16.0 2023-06-22 06:42:29,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=914634.0, ans=0.0 2023-06-22 06:42:32,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914634.0, ans=0.125 2023-06-22 06:44:28,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=914754.0, ans=0.125 2023-06-22 06:44:30,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-22 06:44:41,058 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-5.pt 2023-06-22 06:47:12,974 INFO [train.py:996] (0/4) Epoch 6, batch 0, loss[loss=0.219, simple_loss=0.2752, pruned_loss=0.08142, over 21294.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2752, pruned_loss=0.08142, over 21294.00 frames. ], batch size: 177, lr: 5.35e-03, grad_scale: 32.0 2023-06-22 06:47:12,976 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 06:48:06,640 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06584, over 1796401.00 frames. 2023-06-22 06:48:06,644 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-22 06:48:18,962 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:48:20,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914838.0, ans=0.125 2023-06-22 06:48:52,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 5.006e+02 6.285e+02 8.348e+02 2.118e+03, threshold=1.257e+03, percent-clipped=42.0 2023-06-22 06:48:57,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=914958.0, ans=0.125 2023-06-22 06:49:24,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=915018.0, ans=0.2 2023-06-22 06:50:15,984 INFO [train.py:996] (0/4) Epoch 6, batch 50, loss[loss=0.3236, simple_loss=0.398, pruned_loss=0.1246, over 21489.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3111, pruned_loss=0.07863, over 962993.18 frames. ], batch size: 471, lr: 5.35e-03, grad_scale: 16.0 2023-06-22 06:50:19,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=915138.0, ans=0.125 2023-06-22 06:50:22,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=915138.0, ans=0.0 2023-06-22 06:50:25,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=915138.0, ans=0.1 2023-06-22 06:50:52,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=915198.0, ans=0.125 2023-06-22 06:51:05,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-22 06:51:51,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-22 06:51:51,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=915378.0, ans=0.0 2023-06-22 06:51:51,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=915378.0, ans=0.0 2023-06-22 06:52:24,648 INFO [train.py:996] (0/4) Epoch 6, batch 100, loss[loss=0.2489, simple_loss=0.3547, pruned_loss=0.07157, over 21443.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3233, pruned_loss=0.08037, over 1686667.08 frames. ], batch size: 211, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:52:29,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=915438.0, ans=0.0 2023-06-22 06:52:38,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=915498.0, ans=0.125 2023-06-22 06:52:38,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=915498.0, ans=0.125 2023-06-22 06:53:01,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=915498.0, ans=0.125 2023-06-22 06:53:10,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.323e+02 2.634e+02 3.053e+02 4.648e+02, threshold=5.268e+02, percent-clipped=0.0 2023-06-22 06:53:14,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=915558.0, ans=0.125 2023-06-22 06:54:10,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.99 vs. limit=12.0 2023-06-22 06:54:19,698 INFO [train.py:996] (0/4) Epoch 6, batch 150, loss[loss=0.2503, simple_loss=0.3343, pruned_loss=0.08309, over 21241.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3305, pruned_loss=0.0816, over 2267920.37 frames. ], batch size: 143, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:55:55,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=915918.0, ans=10.0 2023-06-22 06:55:59,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-22 06:56:43,930 INFO [train.py:996] (0/4) Epoch 6, batch 200, loss[loss=0.3153, simple_loss=0.37, pruned_loss=0.1303, over 21387.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3276, pruned_loss=0.08195, over 2700422.84 frames. ], batch size: 471, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:56:44,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.03 vs. limit=10.0 2023-06-22 06:57:42,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.601e+02 2.986e+02 3.624e+02 6.597e+02, threshold=5.972e+02, percent-clipped=3.0 2023-06-22 06:58:43,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=916278.0, ans=0.125 2023-06-22 06:58:53,833 INFO [train.py:996] (0/4) Epoch 6, batch 250, loss[loss=0.2483, simple_loss=0.327, pruned_loss=0.08481, over 21594.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3248, pruned_loss=0.0816, over 3051094.56 frames. ], batch size: 389, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 06:59:42,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=916398.0, ans=0.125 2023-06-22 07:00:21,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=916518.0, ans=0.2 2023-06-22 07:00:21,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.07 vs. limit=5.0 2023-06-22 07:00:43,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=916518.0, ans=0.0 2023-06-22 07:01:15,606 INFO [train.py:996] (0/4) Epoch 6, batch 300, loss[loss=0.181, simple_loss=0.2475, pruned_loss=0.05721, over 21369.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3201, pruned_loss=0.08134, over 3317940.67 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 07:01:16,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=916638.0, ans=0.0 2023-06-22 07:01:18,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-22 07:01:38,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-22 07:01:42,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=916638.0, ans=0.2 2023-06-22 07:02:08,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=916698.0, ans=0.125 2023-06-22 07:02:11,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.629e+02 3.080e+02 3.512e+02 4.991e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-22 07:02:22,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=916758.0, ans=0.0 2023-06-22 07:02:54,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916878.0, ans=0.1 2023-06-22 07:03:02,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=916878.0, ans=0.125 2023-06-22 07:03:36,292 INFO [train.py:996] (0/4) Epoch 6, batch 350, loss[loss=0.22, simple_loss=0.316, pruned_loss=0.06204, over 21740.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3137, pruned_loss=0.07982, over 3536830.02 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-22 07:04:27,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=917058.0, ans=0.0 2023-06-22 07:04:36,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=917058.0, ans=0.125 2023-06-22 07:04:39,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=917118.0, ans=0.1 2023-06-22 07:05:07,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=917118.0, ans=0.05 2023-06-22 07:05:07,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=917118.0, ans=0.125 2023-06-22 07:05:42,665 INFO [train.py:996] (0/4) Epoch 6, batch 400, loss[loss=0.235, simple_loss=0.3536, pruned_loss=0.0582, over 20803.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3078, pruned_loss=0.07746, over 3698708.31 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:06:37,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=917298.0, ans=0.0 2023-06-22 07:06:41,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.619e+02 3.011e+02 3.416e+02 5.139e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-22 07:07:55,473 INFO [train.py:996] (0/4) Epoch 6, batch 450, loss[loss=0.1691, simple_loss=0.2493, pruned_loss=0.04442, over 21299.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3027, pruned_loss=0.07635, over 3831426.26 frames. ], batch size: 176, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:07:55,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=917538.0, ans=0.125 2023-06-22 07:08:09,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=917538.0, ans=0.125 2023-06-22 07:08:09,293 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:08:54,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=917598.0, ans=0.0 2023-06-22 07:09:26,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-22 07:09:35,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-22 07:10:16,303 INFO [train.py:996] (0/4) Epoch 6, batch 500, loss[loss=0.2056, simple_loss=0.2667, pruned_loss=0.07223, over 21703.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3045, pruned_loss=0.07567, over 3938991.12 frames. ], batch size: 112, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:10:17,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-22 07:11:01,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.514e+02 2.945e+02 3.485e+02 5.759e+02, threshold=5.890e+02, percent-clipped=0.0 2023-06-22 07:11:23,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=918018.0, ans=0.125 2023-06-22 07:11:26,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=15.0 2023-06-22 07:11:50,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=918018.0, ans=0.125 2023-06-22 07:12:09,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=918078.0, ans=0.1 2023-06-22 07:12:28,437 INFO [train.py:996] (0/4) Epoch 6, batch 550, loss[loss=0.2181, simple_loss=0.2905, pruned_loss=0.07282, over 19935.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.307, pruned_loss=0.0749, over 4021145.86 frames. ], batch size: 704, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:12:41,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=918138.0, ans=0.0 2023-06-22 07:13:31,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918258.0, ans=0.125 2023-06-22 07:14:38,195 INFO [train.py:996] (0/4) Epoch 6, batch 600, loss[loss=0.2301, simple_loss=0.3027, pruned_loss=0.07874, over 21366.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3074, pruned_loss=0.07431, over 4076144.14 frames. ], batch size: 176, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:15:22,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.705e+02 3.324e+02 3.965e+02 6.330e+02, threshold=6.647e+02, percent-clipped=3.0 2023-06-22 07:16:19,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=918618.0, ans=0.05 2023-06-22 07:16:42,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=918678.0, ans=0.0 2023-06-22 07:16:49,482 INFO [train.py:996] (0/4) Epoch 6, batch 650, loss[loss=0.2435, simple_loss=0.3083, pruned_loss=0.08931, over 21909.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3078, pruned_loss=0.07441, over 4128943.77 frames. ], batch size: 414, lr: 5.34e-03, grad_scale: 32.0 2023-06-22 07:17:03,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=918738.0, ans=0.2 2023-06-22 07:17:19,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918798.0, ans=0.125 2023-06-22 07:17:22,679 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:17:57,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=918858.0, ans=0.0 2023-06-22 07:18:33,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=918978.0, ans=0.035 2023-06-22 07:18:50,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=918978.0, ans=0.1 2023-06-22 07:18:53,911 INFO [train.py:996] (0/4) Epoch 6, batch 700, loss[loss=0.2417, simple_loss=0.3746, pruned_loss=0.05441, over 19743.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3093, pruned_loss=0.07584, over 4160546.27 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:19:46,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.473e+02 2.771e+02 3.367e+02 4.695e+02, threshold=5.542e+02, percent-clipped=0.0 2023-06-22 07:20:06,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=919158.0, ans=0.125 2023-06-22 07:20:22,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=919218.0, ans=0.2 2023-06-22 07:20:27,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=919218.0, ans=0.0 2023-06-22 07:21:03,747 INFO [train.py:996] (0/4) Epoch 6, batch 750, loss[loss=0.2448, simple_loss=0.3257, pruned_loss=0.08194, over 21651.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3086, pruned_loss=0.07699, over 4188235.38 frames. ], batch size: 230, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:21:06,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-22 07:21:36,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=919398.0, ans=0.125 2023-06-22 07:22:22,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-22 07:22:27,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=919518.0, ans=0.125 2023-06-22 07:22:38,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=919578.0, ans=0.125 2023-06-22 07:23:09,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=919638.0, ans=0.125 2023-06-22 07:23:10,545 INFO [train.py:996] (0/4) Epoch 6, batch 800, loss[loss=0.2141, simple_loss=0.2874, pruned_loss=0.07041, over 21707.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3052, pruned_loss=0.07708, over 4203972.19 frames. ], batch size: 298, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:23:33,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=919638.0, ans=0.2 2023-06-22 07:24:07,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=919758.0, ans=0.0 2023-06-22 07:24:08,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.537e+02 3.024e+02 3.645e+02 6.511e+02, threshold=6.048e+02, percent-clipped=3.0 2023-06-22 07:24:58,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=919818.0, ans=0.125 2023-06-22 07:25:15,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=919878.0, ans=0.0 2023-06-22 07:25:23,368 INFO [train.py:996] (0/4) Epoch 6, batch 850, loss[loss=0.2183, simple_loss=0.3467, pruned_loss=0.04491, over 20798.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3032, pruned_loss=0.07641, over 4217470.01 frames. ], batch size: 608, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:25:47,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=919938.0, ans=0.125 2023-06-22 07:26:00,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=919998.0, ans=0.0 2023-06-22 07:26:41,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=920058.0, ans=0.125 2023-06-22 07:26:45,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920118.0, ans=0.1 2023-06-22 07:26:48,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=920118.0, ans=0.125 2023-06-22 07:27:43,055 INFO [train.py:996] (0/4) Epoch 6, batch 900, loss[loss=0.2399, simple_loss=0.3153, pruned_loss=0.08224, over 21851.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.301, pruned_loss=0.07584, over 4232570.79 frames. ], batch size: 371, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:28:20,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.0 2023-06-22 07:28:24,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.591e+02 2.994e+02 3.530e+02 5.655e+02, threshold=5.988e+02, percent-clipped=0.0 2023-06-22 07:29:48,822 INFO [train.py:996] (0/4) Epoch 6, batch 950, loss[loss=0.1718, simple_loss=0.25, pruned_loss=0.04679, over 21290.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2985, pruned_loss=0.07563, over 4249526.52 frames. ], batch size: 176, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:30:33,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=920598.0, ans=0.125 2023-06-22 07:30:33,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-22 07:30:48,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=920658.0, ans=0.0 2023-06-22 07:31:33,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=920718.0, ans=0.0 2023-06-22 07:32:08,551 INFO [train.py:996] (0/4) Epoch 6, batch 1000, loss[loss=0.197, simple_loss=0.2546, pruned_loss=0.06968, over 21193.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2976, pruned_loss=0.07563, over 4260156.68 frames. ], batch size: 548, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:32:19,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=920838.0, ans=0.125 2023-06-22 07:32:47,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=920898.0, ans=0.0 2023-06-22 07:33:01,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.550e+02 2.798e+02 3.235e+02 6.072e+02, threshold=5.596e+02, percent-clipped=1.0 2023-06-22 07:33:08,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-22 07:34:12,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=921078.0, ans=0.0 2023-06-22 07:34:20,072 INFO [train.py:996] (0/4) Epoch 6, batch 1050, loss[loss=0.2382, simple_loss=0.3286, pruned_loss=0.07385, over 21793.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2987, pruned_loss=0.07575, over 4271689.27 frames. ], batch size: 371, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:34:32,404 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:34:41,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-22 07:35:42,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=921318.0, ans=0.0 2023-06-22 07:35:45,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=921318.0, ans=0.5 2023-06-22 07:36:26,860 INFO [train.py:996] (0/4) Epoch 6, batch 1100, loss[loss=0.1871, simple_loss=0.2748, pruned_loss=0.04969, over 21639.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2981, pruned_loss=0.07453, over 4270801.60 frames. ], batch size: 263, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:37:08,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=921498.0, ans=0.125 2023-06-22 07:37:18,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.723e+02 3.096e+02 3.940e+02 7.393e+02, threshold=6.192e+02, percent-clipped=9.0 2023-06-22 07:38:35,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=921678.0, ans=0.125 2023-06-22 07:38:44,052 INFO [train.py:996] (0/4) Epoch 6, batch 1150, loss[loss=0.2021, simple_loss=0.268, pruned_loss=0.06814, over 16664.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2996, pruned_loss=0.07465, over 4275063.01 frames. ], batch size: 60, lr: 5.33e-03, grad_scale: 16.0 2023-06-22 07:40:27,576 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:40:50,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=921978.0, ans=0.0 2023-06-22 07:40:55,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=921978.0, ans=0.125 2023-06-22 07:41:01,465 INFO [train.py:996] (0/4) Epoch 6, batch 1200, loss[loss=0.2059, simple_loss=0.2645, pruned_loss=0.07369, over 21188.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3009, pruned_loss=0.0757, over 4278730.80 frames. ], batch size: 608, lr: 5.33e-03, grad_scale: 32.0 2023-06-22 07:41:01,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=922038.0, ans=0.125 2023-06-22 07:42:14,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.527e+02 2.979e+02 3.746e+02 6.173e+02, threshold=5.958e+02, percent-clipped=0.0 2023-06-22 07:43:04,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=922278.0, ans=0.2 2023-06-22 07:43:23,983 INFO [train.py:996] (0/4) Epoch 6, batch 1250, loss[loss=0.2336, simple_loss=0.3022, pruned_loss=0.08246, over 21841.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3042, pruned_loss=0.07762, over 4276588.38 frames. ], batch size: 107, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:43:46,153 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:44:33,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=922458.0, ans=0.125 2023-06-22 07:44:45,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=922458.0, ans=0.125 2023-06-22 07:44:55,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922458.0, ans=0.1 2023-06-22 07:45:14,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 07:45:15,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=922518.0, ans=0.125 2023-06-22 07:45:34,446 INFO [train.py:996] (0/4) Epoch 6, batch 1300, loss[loss=0.2018, simple_loss=0.2732, pruned_loss=0.06523, over 21435.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.306, pruned_loss=0.07812, over 4277904.17 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:45:54,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-22 07:46:08,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=922698.0, ans=0.125 2023-06-22 07:46:17,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=922698.0, ans=0.125 2023-06-22 07:46:54,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.638e+02 3.096e+02 3.825e+02 7.395e+02, threshold=6.191e+02, percent-clipped=3.0 2023-06-22 07:47:11,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=922758.0, ans=0.0 2023-06-22 07:47:59,659 INFO [train.py:996] (0/4) Epoch 6, batch 1350, loss[loss=0.2163, simple_loss=0.3027, pruned_loss=0.06499, over 21636.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3068, pruned_loss=0.07782, over 4274083.09 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:48:23,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=922938.0, ans=0.2 2023-06-22 07:48:33,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=922998.0, ans=0.0 2023-06-22 07:49:06,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=923058.0, ans=0.0 2023-06-22 07:49:34,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-22 07:49:41,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=923118.0, ans=0.125 2023-06-22 07:49:45,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-22 07:50:04,093 INFO [train.py:996] (0/4) Epoch 6, batch 1400, loss[loss=0.2147, simple_loss=0.2843, pruned_loss=0.07255, over 21800.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.305, pruned_loss=0.07845, over 4279696.86 frames. ], batch size: 98, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:50:22,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=923238.0, ans=0.125 2023-06-22 07:50:24,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=923238.0, ans=15.0 2023-06-22 07:50:35,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=923298.0, ans=0.0 2023-06-22 07:50:48,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=923298.0, ans=0.125 2023-06-22 07:50:50,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=923298.0, ans=0.2 2023-06-22 07:51:09,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=923358.0, ans=0.125 2023-06-22 07:51:10,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.497e+02 2.735e+02 3.069e+02 5.769e+02, threshold=5.470e+02, percent-clipped=0.0 2023-06-22 07:51:34,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-22 07:52:19,685 INFO [train.py:996] (0/4) Epoch 6, batch 1450, loss[loss=0.2182, simple_loss=0.2772, pruned_loss=0.07961, over 21650.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3057, pruned_loss=0.07937, over 4278258.51 frames. ], batch size: 415, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:52:55,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=923598.0, ans=0.0 2023-06-22 07:53:23,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=923658.0, ans=0.0 2023-06-22 07:54:36,508 INFO [train.py:996] (0/4) Epoch 6, batch 1500, loss[loss=0.235, simple_loss=0.3029, pruned_loss=0.08355, over 21336.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3081, pruned_loss=0.08016, over 4279661.87 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:55:21,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=923898.0, ans=0.125 2023-06-22 07:55:31,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.593e+02 2.969e+02 3.439e+02 4.928e+02, threshold=5.939e+02, percent-clipped=0.0 2023-06-22 07:55:41,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923958.0, ans=0.1 2023-06-22 07:55:59,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=924018.0, ans=0.125 2023-06-22 07:55:59,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=924018.0, ans=0.125 2023-06-22 07:56:38,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=924138.0, ans=0.2 2023-06-22 07:56:39,027 INFO [train.py:996] (0/4) Epoch 6, batch 1550, loss[loss=0.2124, simple_loss=0.2874, pruned_loss=0.06869, over 21494.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3063, pruned_loss=0.07884, over 4279548.44 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:56:39,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=924138.0, ans=0.125 2023-06-22 07:58:58,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-22 07:59:11,422 INFO [train.py:996] (0/4) Epoch 6, batch 1600, loss[loss=0.2244, simple_loss=0.2952, pruned_loss=0.07679, over 21801.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3051, pruned_loss=0.07906, over 4275914.44 frames. ], batch size: 316, lr: 5.32e-03, grad_scale: 32.0 2023-06-22 07:59:32,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=924438.0, ans=0.05 2023-06-22 07:59:45,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=924498.0, ans=0.0 2023-06-22 08:00:09,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.447e+02 2.889e+02 3.506e+02 5.752e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-22 08:01:09,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=924678.0, ans=0.2 2023-06-22 08:01:18,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=924678.0, ans=0.125 2023-06-22 08:01:23,864 INFO [train.py:996] (0/4) Epoch 6, batch 1650, loss[loss=0.2391, simple_loss=0.3048, pruned_loss=0.08673, over 21746.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3041, pruned_loss=0.07892, over 4274710.27 frames. ], batch size: 389, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:01:46,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=924738.0, ans=0.0 2023-06-22 08:03:14,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-22 08:03:42,429 INFO [train.py:996] (0/4) Epoch 6, batch 1700, loss[loss=0.2454, simple_loss=0.3197, pruned_loss=0.08557, over 21694.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3073, pruned_loss=0.0797, over 4278557.51 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:04:36,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=925098.0, ans=0.125 2023-06-22 08:04:42,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.575e+02 2.876e+02 3.379e+02 6.371e+02, threshold=5.752e+02, percent-clipped=1.0 2023-06-22 08:06:14,620 INFO [train.py:996] (0/4) Epoch 6, batch 1750, loss[loss=0.1857, simple_loss=0.2756, pruned_loss=0.04789, over 21710.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3062, pruned_loss=0.0775, over 4275355.88 frames. ], batch size: 332, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:07:01,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=925398.0, ans=0.125 2023-06-22 08:07:09,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=925458.0, ans=0.07 2023-06-22 08:07:23,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925458.0, ans=0.125 2023-06-22 08:08:16,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-22 08:08:17,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=925578.0, ans=0.2 2023-06-22 08:08:31,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=925578.0, ans=0.035 2023-06-22 08:08:43,944 INFO [train.py:996] (0/4) Epoch 6, batch 1800, loss[loss=0.2128, simple_loss=0.2978, pruned_loss=0.06387, over 21290.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3049, pruned_loss=0.07501, over 4274737.51 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 16.0 2023-06-22 08:09:38,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.513e+02 3.137e+02 3.734e+02 6.683e+02, threshold=6.274e+02, percent-clipped=3.0 2023-06-22 08:10:15,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-22 08:10:36,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=925878.0, ans=0.0 2023-06-22 08:10:37,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925878.0, ans=0.125 2023-06-22 08:10:39,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925878.0, ans=0.1 2023-06-22 08:10:48,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=925878.0, ans=0.125 2023-06-22 08:10:58,597 INFO [train.py:996] (0/4) Epoch 6, batch 1850, loss[loss=0.242, simple_loss=0.3151, pruned_loss=0.08443, over 21512.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.305, pruned_loss=0.07295, over 4274942.94 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:11:04,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-22 08:11:15,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=925998.0, ans=0.2 2023-06-22 08:11:16,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-22 08:12:09,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=926058.0, ans=0.125 2023-06-22 08:12:12,733 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:12:42,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=926178.0, ans=0.125 2023-06-22 08:13:07,276 INFO [train.py:996] (0/4) Epoch 6, batch 1900, loss[loss=0.2075, simple_loss=0.282, pruned_loss=0.06649, over 21763.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3059, pruned_loss=0.0745, over 4276696.34 frames. ], batch size: 112, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:13:49,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926298.0, ans=0.125 2023-06-22 08:13:59,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.371e+02 2.637e+02 3.288e+02 5.530e+02, threshold=5.274e+02, percent-clipped=0.0 2023-06-22 08:14:54,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926478.0, ans=0.1 2023-06-22 08:15:08,554 INFO [train.py:996] (0/4) Epoch 6, batch 1950, loss[loss=0.2064, simple_loss=0.2646, pruned_loss=0.0741, over 21582.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3023, pruned_loss=0.0735, over 4253215.13 frames. ], batch size: 415, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:17:07,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=926778.0, ans=0.0 2023-06-22 08:17:25,660 INFO [train.py:996] (0/4) Epoch 6, batch 2000, loss[loss=0.1932, simple_loss=0.2639, pruned_loss=0.06123, over 21644.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2966, pruned_loss=0.07138, over 4264166.08 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:17:42,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=926838.0, ans=0.125 2023-06-22 08:18:03,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-22 08:18:04,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=926898.0, ans=0.07 2023-06-22 08:18:41,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.517e+02 3.003e+02 3.680e+02 6.988e+02, threshold=6.006e+02, percent-clipped=2.0 2023-06-22 08:18:42,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=926958.0, ans=0.1 2023-06-22 08:19:10,528 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:19:17,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=927018.0, ans=0.2 2023-06-22 08:19:21,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-22 08:19:41,266 INFO [train.py:996] (0/4) Epoch 6, batch 2050, loss[loss=0.1929, simple_loss=0.2658, pruned_loss=0.05997, over 21630.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2984, pruned_loss=0.07146, over 4267307.70 frames. ], batch size: 298, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:19:44,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=927138.0, ans=0.0 2023-06-22 08:20:12,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927138.0, ans=0.1 2023-06-22 08:20:13,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=927138.0, ans=15.0 2023-06-22 08:20:33,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=927198.0, ans=0.0 2023-06-22 08:20:44,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=927258.0, ans=0.95 2023-06-22 08:21:26,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-22 08:21:53,269 INFO [train.py:996] (0/4) Epoch 6, batch 2100, loss[loss=0.2327, simple_loss=0.3388, pruned_loss=0.06327, over 21171.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2993, pruned_loss=0.07341, over 4272964.19 frames. ], batch size: 548, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:22:57,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=927498.0, ans=0.0 2023-06-22 08:23:02,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.488e+02 2.797e+02 3.181e+02 4.805e+02, threshold=5.593e+02, percent-clipped=0.0 2023-06-22 08:23:40,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927678.0, ans=0.1 2023-06-22 08:23:55,929 INFO [train.py:996] (0/4) Epoch 6, batch 2150, loss[loss=0.2394, simple_loss=0.298, pruned_loss=0.09037, over 21598.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3013, pruned_loss=0.07542, over 4279208.06 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:24:02,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=927738.0, ans=0.0 2023-06-22 08:24:08,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=927738.0, ans=0.0 2023-06-22 08:24:21,181 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:25:45,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=927918.0, ans=0.125 2023-06-22 08:26:29,708 INFO [train.py:996] (0/4) Epoch 6, batch 2200, loss[loss=0.2575, simple_loss=0.3372, pruned_loss=0.08892, over 21459.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3045, pruned_loss=0.07681, over 4279287.19 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:26:33,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=928038.0, ans=0.035 2023-06-22 08:26:34,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=928038.0, ans=0.125 2023-06-22 08:26:58,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-22 08:27:19,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.524e+02 2.938e+02 3.360e+02 6.065e+02, threshold=5.877e+02, percent-clipped=1.0 2023-06-22 08:27:48,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.91 vs. limit=15.0 2023-06-22 08:28:11,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=928278.0, ans=0.125 2023-06-22 08:28:32,322 INFO [train.py:996] (0/4) Epoch 6, batch 2250, loss[loss=0.1772, simple_loss=0.2444, pruned_loss=0.05497, over 21403.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3029, pruned_loss=0.07536, over 4281440.19 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:30:19,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=928578.0, ans=0.2 2023-06-22 08:30:23,221 INFO [train.py:996] (0/4) Epoch 6, batch 2300, loss[loss=0.2171, simple_loss=0.2801, pruned_loss=0.07705, over 21818.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2986, pruned_loss=0.07443, over 4278731.14 frames. ], batch size: 352, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:31:28,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.418e+02 2.777e+02 3.405e+02 6.239e+02, threshold=5.554e+02, percent-clipped=2.0 2023-06-22 08:32:04,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=928818.0, ans=0.125 2023-06-22 08:32:09,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=928818.0, ans=0.0 2023-06-22 08:32:14,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=928818.0, ans=0.0 2023-06-22 08:32:15,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=928878.0, ans=0.125 2023-06-22 08:32:30,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928938.0, ans=0.1 2023-06-22 08:32:31,782 INFO [train.py:996] (0/4) Epoch 6, batch 2350, loss[loss=0.2486, simple_loss=0.3057, pruned_loss=0.09582, over 21254.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2951, pruned_loss=0.0745, over 4280845.57 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 16.0 2023-06-22 08:33:28,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=928998.0, ans=0.05 2023-06-22 08:34:14,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=929118.0, ans=0.025 2023-06-22 08:34:49,993 INFO [train.py:996] (0/4) Epoch 6, batch 2400, loss[loss=0.2577, simple_loss=0.3312, pruned_loss=0.09208, over 21718.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3004, pruned_loss=0.07674, over 4278924.52 frames. ], batch size: 332, lr: 5.31e-03, grad_scale: 32.0 2023-06-22 08:35:18,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-06-22 08:36:00,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-06-22 08:36:02,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.593e+02 2.927e+02 3.472e+02 6.319e+02, threshold=5.855e+02, percent-clipped=5.0 2023-06-22 08:37:11,337 INFO [train.py:996] (0/4) Epoch 6, batch 2450, loss[loss=0.2198, simple_loss=0.2906, pruned_loss=0.07444, over 15213.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3045, pruned_loss=0.07968, over 4269949.82 frames. ], batch size: 60, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:38:27,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=929718.0, ans=0.0 2023-06-22 08:38:56,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=929778.0, ans=0.0 2023-06-22 08:39:13,826 INFO [train.py:996] (0/4) Epoch 6, batch 2500, loss[loss=0.2594, simple_loss=0.2968, pruned_loss=0.1111, over 21366.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3041, pruned_loss=0.07917, over 4270375.93 frames. ], batch size: 508, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:39:45,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=929898.0, ans=0.0 2023-06-22 08:39:58,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=929898.0, ans=0.125 2023-06-22 08:40:23,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.580e+02 2.958e+02 3.428e+02 5.178e+02, threshold=5.916e+02, percent-clipped=0.0 2023-06-22 08:41:26,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=930078.0, ans=0.125 2023-06-22 08:41:28,661 INFO [train.py:996] (0/4) Epoch 6, batch 2550, loss[loss=0.2264, simple_loss=0.2961, pruned_loss=0.07831, over 15108.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3028, pruned_loss=0.07766, over 4262186.32 frames. ], batch size: 60, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:41:29,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=930138.0, ans=0.0 2023-06-22 08:42:30,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-22 08:43:51,034 INFO [train.py:996] (0/4) Epoch 6, batch 2600, loss[loss=0.1975, simple_loss=0.2698, pruned_loss=0.06266, over 21587.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3034, pruned_loss=0.0776, over 4256822.09 frames. ], batch size: 263, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:44:42,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=930498.0, ans=0.05 2023-06-22 08:45:08,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.524e+02 2.947e+02 3.277e+02 5.096e+02, threshold=5.894e+02, percent-clipped=0.0 2023-06-22 08:45:34,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=930678.0, ans=0.04949747468305833 2023-06-22 08:45:37,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=930678.0, ans=0.0 2023-06-22 08:46:11,835 INFO [train.py:996] (0/4) Epoch 6, batch 2650, loss[loss=0.2253, simple_loss=0.2947, pruned_loss=0.07794, over 21392.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3052, pruned_loss=0.0782, over 4267935.05 frames. ], batch size: 143, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:46:50,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=930798.0, ans=0.125 2023-06-22 08:47:24,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=930858.0, ans=0.125 2023-06-22 08:47:36,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=930918.0, ans=0.125 2023-06-22 08:48:04,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=930978.0, ans=0.0 2023-06-22 08:48:18,342 INFO [train.py:996] (0/4) Epoch 6, batch 2700, loss[loss=0.2234, simple_loss=0.3081, pruned_loss=0.06938, over 21621.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3038, pruned_loss=0.07814, over 4269862.98 frames. ], batch size: 389, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:49:08,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=931098.0, ans=0.125 2023-06-22 08:49:35,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.647e+02 2.951e+02 3.422e+02 5.387e+02, threshold=5.902e+02, percent-clipped=0.0 2023-06-22 08:49:50,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=931218.0, ans=0.125 2023-06-22 08:50:25,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=931278.0, ans=0.0 2023-06-22 08:50:35,159 INFO [train.py:996] (0/4) Epoch 6, batch 2750, loss[loss=0.2551, simple_loss=0.3274, pruned_loss=0.09136, over 21741.00 frames. ], tot_loss[loss=0.23, simple_loss=0.303, pruned_loss=0.07852, over 4274731.89 frames. ], batch size: 112, lr: 5.30e-03, grad_scale: 16.0 2023-06-22 08:51:35,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-22 08:53:01,320 INFO [train.py:996] (0/4) Epoch 6, batch 2800, loss[loss=0.2362, simple_loss=0.3326, pruned_loss=0.06988, over 21404.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.308, pruned_loss=0.07936, over 4272173.38 frames. ], batch size: 211, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:53:57,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=931698.0, ans=0.0 2023-06-22 08:54:11,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.698e+02 3.040e+02 3.430e+02 5.325e+02, threshold=6.080e+02, percent-clipped=0.0 2023-06-22 08:54:16,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931758.0, ans=0.1 2023-06-22 08:54:17,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931758.0, ans=0.1 2023-06-22 08:54:50,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=931878.0, ans=0.0 2023-06-22 08:55:25,872 INFO [train.py:996] (0/4) Epoch 6, batch 2850, loss[loss=0.229, simple_loss=0.301, pruned_loss=0.07847, over 21728.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3096, pruned_loss=0.08123, over 4272960.23 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:55:30,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=931938.0, ans=0.0 2023-06-22 08:55:56,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-22 08:55:57,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=931998.0, ans=0.125 2023-06-22 08:56:09,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-22 08:56:10,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=931998.0, ans=0.125 2023-06-22 08:57:10,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932118.0, ans=0.1 2023-06-22 08:57:34,745 INFO [train.py:996] (0/4) Epoch 6, batch 2900, loss[loss=0.2522, simple_loss=0.308, pruned_loss=0.09817, over 21733.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3065, pruned_loss=0.08054, over 4281723.83 frames. ], batch size: 473, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 08:58:34,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=932358.0, ans=0.0 2023-06-22 08:58:38,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.660e+02 3.154e+02 3.986e+02 8.685e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-22 08:59:26,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=932478.0, ans=0.125 2023-06-22 08:59:51,156 INFO [train.py:996] (0/4) Epoch 6, batch 2950, loss[loss=0.2117, simple_loss=0.2886, pruned_loss=0.06745, over 21868.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3075, pruned_loss=0.08088, over 4290942.09 frames. ], batch size: 118, lr: 5.30e-03, grad_scale: 32.0 2023-06-22 09:00:35,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=932598.0, ans=0.125 2023-06-22 09:00:36,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=932598.0, ans=0.0 2023-06-22 09:01:03,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=932658.0, ans=0.125 2023-06-22 09:02:10,245 INFO [train.py:996] (0/4) Epoch 6, batch 3000, loss[loss=0.2832, simple_loss=0.3501, pruned_loss=0.1081, over 21787.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3113, pruned_loss=0.08086, over 4295370.74 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:02:10,250 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 09:02:58,961 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6594, 2.0439, 3.3762, 2.0589], device='cuda:0') 2023-06-22 09:03:08,551 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2509, simple_loss=0.3421, pruned_loss=0.07991, over 1796401.00 frames. 2023-06-22 09:03:08,553 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-22 09:03:20,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932838.0, ans=0.1 2023-06-22 09:03:23,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-22 09:03:29,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=932898.0, ans=0.125 2023-06-22 09:03:51,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=932958.0, ans=0.0 2023-06-22 09:03:51,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=932958.0, ans=0.07 2023-06-22 09:03:57,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.599e+02 2.976e+02 3.379e+02 5.904e+02, threshold=5.951e+02, percent-clipped=0.0 2023-06-22 09:04:25,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=933018.0, ans=0.125 2023-06-22 09:05:08,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=933078.0, ans=0.125 2023-06-22 09:05:26,338 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:05:27,298 INFO [train.py:996] (0/4) Epoch 6, batch 3050, loss[loss=0.225, simple_loss=0.304, pruned_loss=0.07303, over 21725.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.311, pruned_loss=0.07881, over 4290141.45 frames. ], batch size: 414, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:05:29,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-22 09:05:43,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=933198.0, ans=0.2 2023-06-22 09:06:41,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=22.5 2023-06-22 09:07:33,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-22 09:07:39,457 INFO [train.py:996] (0/4) Epoch 6, batch 3100, loss[loss=0.2261, simple_loss=0.3206, pruned_loss=0.06574, over 21588.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3108, pruned_loss=0.07836, over 4296587.46 frames. ], batch size: 389, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:08:44,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.677e+02 3.194e+02 3.911e+02 7.241e+02, threshold=6.388e+02, percent-clipped=3.0 2023-06-22 09:09:26,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=933618.0, ans=0.125 2023-06-22 09:10:02,626 INFO [train.py:996] (0/4) Epoch 6, batch 3150, loss[loss=0.266, simple_loss=0.338, pruned_loss=0.09699, over 21237.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3119, pruned_loss=0.07852, over 4291445.66 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:10:03,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=933738.0, ans=0.125 2023-06-22 09:10:09,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=933738.0, ans=0.025 2023-06-22 09:10:58,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-22 09:11:56,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=933918.0, ans=0.125 2023-06-22 09:11:56,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-22 09:12:15,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-22 09:12:27,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-22 09:12:30,960 INFO [train.py:996] (0/4) Epoch 6, batch 3200, loss[loss=0.2311, simple_loss=0.3112, pruned_loss=0.07548, over 21712.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.313, pruned_loss=0.07885, over 4286846.65 frames. ], batch size: 298, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:13:53,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.451e+02 2.830e+02 3.191e+02 4.381e+02, threshold=5.660e+02, percent-clipped=0.0 2023-06-22 09:14:28,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=934278.0, ans=0.0 2023-06-22 09:14:44,658 INFO [train.py:996] (0/4) Epoch 6, batch 3250, loss[loss=0.1999, simple_loss=0.248, pruned_loss=0.07591, over 20782.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3136, pruned_loss=0.07997, over 4283395.82 frames. ], batch size: 609, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:14:48,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-22 09:14:49,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=934338.0, ans=0.125 2023-06-22 09:14:51,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-22 09:15:08,665 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:17:13,740 INFO [train.py:996] (0/4) Epoch 6, batch 3300, loss[loss=0.2042, simple_loss=0.2757, pruned_loss=0.06639, over 21628.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3071, pruned_loss=0.07957, over 4277944.81 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:17:14,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=934638.0, ans=0.125 2023-06-22 09:18:24,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.651e+02 2.919e+02 3.305e+02 7.329e+02, threshold=5.839e+02, percent-clipped=1.0 2023-06-22 09:19:19,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=934878.0, ans=0.0 2023-06-22 09:19:38,758 INFO [train.py:996] (0/4) Epoch 6, batch 3350, loss[loss=0.2498, simple_loss=0.3177, pruned_loss=0.09095, over 21371.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.309, pruned_loss=0.07955, over 4272173.69 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:20:07,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 09:20:15,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=934998.0, ans=0.5 2023-06-22 09:20:18,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=934998.0, ans=0.07 2023-06-22 09:20:31,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=934998.0, ans=0.125 2023-06-22 09:21:22,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-22 09:21:52,353 INFO [train.py:996] (0/4) Epoch 6, batch 3400, loss[loss=0.2197, simple_loss=0.2927, pruned_loss=0.07333, over 21536.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3089, pruned_loss=0.08007, over 4273751.05 frames. ], batch size: 195, lr: 5.29e-03, grad_scale: 32.0 2023-06-22 09:22:36,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=935298.0, ans=0.0 2023-06-22 09:22:52,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=935298.0, ans=0.125 2023-06-22 09:23:02,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.660e+02 3.094e+02 4.133e+02 6.206e+02, threshold=6.188e+02, percent-clipped=2.0 2023-06-22 09:23:11,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=935418.0, ans=0.0 2023-06-22 09:24:19,147 INFO [train.py:996] (0/4) Epoch 6, batch 3450, loss[loss=0.1961, simple_loss=0.2535, pruned_loss=0.06937, over 21469.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3035, pruned_loss=0.07876, over 4271987.55 frames. ], batch size: 212, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:25:17,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935658.0, ans=0.1 2023-06-22 09:26:06,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=935778.0, ans=0.2 2023-06-22 09:26:29,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935838.0, ans=0.1 2023-06-22 09:26:30,610 INFO [train.py:996] (0/4) Epoch 6, batch 3500, loss[loss=0.2554, simple_loss=0.3313, pruned_loss=0.08976, over 21373.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3136, pruned_loss=0.08294, over 4277026.46 frames. ], batch size: 549, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:26:35,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935838.0, ans=0.1 2023-06-22 09:27:07,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935898.0, ans=0.1 2023-06-22 09:27:33,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.716e+02 3.159e+02 3.539e+02 5.891e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-22 09:27:33,876 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-156000.pt 2023-06-22 09:27:53,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=936018.0, ans=0.125 2023-06-22 09:27:56,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=936018.0, ans=0.0 2023-06-22 09:27:59,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=936018.0, ans=0.125 2023-06-22 09:28:44,936 INFO [train.py:996] (0/4) Epoch 6, batch 3550, loss[loss=0.2135, simple_loss=0.2825, pruned_loss=0.07222, over 21687.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3162, pruned_loss=0.08436, over 4281281.81 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 16.0 2023-06-22 09:28:55,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=936138.0, ans=0.125 2023-06-22 09:29:17,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=936198.0, ans=0.0 2023-06-22 09:30:12,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936318.0, ans=0.1 2023-06-22 09:30:45,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936378.0, ans=0.1 2023-06-22 09:30:51,132 INFO [train.py:996] (0/4) Epoch 6, batch 3600, loss[loss=0.2477, simple_loss=0.3192, pruned_loss=0.08813, over 21854.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3111, pruned_loss=0.08322, over 4274901.41 frames. ], batch size: 118, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:32:15,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.586e+02 3.004e+02 3.454e+02 6.703e+02, threshold=6.007e+02, percent-clipped=1.0 2023-06-22 09:33:00,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=936678.0, ans=0.0 2023-06-22 09:33:23,797 INFO [train.py:996] (0/4) Epoch 6, batch 3650, loss[loss=0.2865, simple_loss=0.3687, pruned_loss=0.1021, over 21544.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3125, pruned_loss=0.08358, over 4273701.08 frames. ], batch size: 508, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:33:37,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=936738.0, ans=0.0 2023-06-22 09:35:30,885 INFO [train.py:996] (0/4) Epoch 6, batch 3700, loss[loss=0.2279, simple_loss=0.3079, pruned_loss=0.07398, over 21792.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3101, pruned_loss=0.08219, over 4285382.59 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:35:38,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-22 09:36:35,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.535e+02 2.948e+02 3.482e+02 5.680e+02, threshold=5.896e+02, percent-clipped=0.0 2023-06-22 09:37:22,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937278.0, ans=0.1 2023-06-22 09:37:48,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=937278.0, ans=0.125 2023-06-22 09:37:50,975 INFO [train.py:996] (0/4) Epoch 6, batch 3750, loss[loss=0.2138, simple_loss=0.2832, pruned_loss=0.0722, over 21845.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3091, pruned_loss=0.08162, over 4286100.95 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:38:55,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=937458.0, ans=0.0 2023-06-22 09:40:15,567 INFO [train.py:996] (0/4) Epoch 6, batch 3800, loss[loss=0.2107, simple_loss=0.2884, pruned_loss=0.06648, over 21118.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3067, pruned_loss=0.07964, over 4284533.78 frames. ], batch size: 608, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:40:35,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937638.0, ans=0.125 2023-06-22 09:40:52,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-22 09:41:09,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=937758.0, ans=0.125 2023-06-22 09:41:16,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.491e+02 2.732e+02 3.396e+02 7.408e+02, threshold=5.464e+02, percent-clipped=3.0 2023-06-22 09:41:28,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937818.0, ans=0.125 2023-06-22 09:42:24,409 INFO [train.py:996] (0/4) Epoch 6, batch 3850, loss[loss=0.1949, simple_loss=0.2584, pruned_loss=0.06564, over 21599.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3042, pruned_loss=0.07943, over 4289343.32 frames. ], batch size: 298, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:42:29,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=937938.0, ans=0.125 2023-06-22 09:42:30,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=937938.0, ans=0.2 2023-06-22 09:42:31,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-22 09:42:42,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=937938.0, ans=0.0 2023-06-22 09:42:47,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=937998.0, ans=0.0 2023-06-22 09:44:12,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=938118.0, ans=0.125 2023-06-22 09:44:34,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=938178.0, ans=0.125 2023-06-22 09:44:44,213 INFO [train.py:996] (0/4) Epoch 6, batch 3900, loss[loss=0.2154, simple_loss=0.2811, pruned_loss=0.07487, over 21847.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2994, pruned_loss=0.07899, over 4288524.64 frames. ], batch size: 371, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:44:54,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-22 09:45:36,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=938358.0, ans=0.0 2023-06-22 09:45:53,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.660e+02 3.085e+02 3.718e+02 7.121e+02, threshold=6.170e+02, percent-clipped=3.0 2023-06-22 09:46:39,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=938478.0, ans=0.5 2023-06-22 09:47:02,406 INFO [train.py:996] (0/4) Epoch 6, batch 3950, loss[loss=0.1866, simple_loss=0.2634, pruned_loss=0.0549, over 21138.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3013, pruned_loss=0.07856, over 4291230.09 frames. ], batch size: 143, lr: 5.28e-03, grad_scale: 16.0 2023-06-22 09:47:02,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=938538.0, ans=0.125 2023-06-22 09:47:27,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=938598.0, ans=0.125 2023-06-22 09:47:47,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=938658.0, ans=0.0 2023-06-22 09:48:14,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=938718.0, ans=0.125 2023-06-22 09:49:00,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=938778.0, ans=0.0 2023-06-22 09:49:05,508 INFO [train.py:996] (0/4) Epoch 6, batch 4000, loss[loss=0.1918, simple_loss=0.2591, pruned_loss=0.06222, over 21778.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.296, pruned_loss=0.07545, over 4283798.53 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:49:17,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=938838.0, ans=0.0 2023-06-22 09:49:29,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=938838.0, ans=15.0 2023-06-22 09:49:36,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=938898.0, ans=0.2 2023-06-22 09:49:55,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=938898.0, ans=0.0 2023-06-22 09:49:57,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=938958.0, ans=0.125 2023-06-22 09:50:04,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-22 09:50:15,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.398e+02 2.701e+02 3.226e+02 5.808e+02, threshold=5.402e+02, percent-clipped=0.0 2023-06-22 09:51:02,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=939078.0, ans=0.125 2023-06-22 09:51:04,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=939078.0, ans=0.125 2023-06-22 09:51:27,239 INFO [train.py:996] (0/4) Epoch 6, batch 4050, loss[loss=0.2523, simple_loss=0.3577, pruned_loss=0.07345, over 21271.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2945, pruned_loss=0.0744, over 4286255.25 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:53:58,554 INFO [train.py:996] (0/4) Epoch 6, batch 4100, loss[loss=0.2341, simple_loss=0.3165, pruned_loss=0.07582, over 21707.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2958, pruned_loss=0.07445, over 4290048.88 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:54:18,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=939498.0, ans=0.2 2023-06-22 09:54:42,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=939558.0, ans=0.125 2023-06-22 09:54:42,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-22 09:55:00,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=939558.0, ans=0.125 2023-06-22 09:55:09,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.299e+02 2.709e+02 3.070e+02 5.765e+02, threshold=5.418e+02, percent-clipped=2.0 2023-06-22 09:55:24,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=939618.0, ans=0.125 2023-06-22 09:56:10,773 INFO [train.py:996] (0/4) Epoch 6, batch 4150, loss[loss=0.18, simple_loss=0.2758, pruned_loss=0.04209, over 21148.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2963, pruned_loss=0.07222, over 4284997.32 frames. ], batch size: 159, lr: 5.28e-03, grad_scale: 32.0 2023-06-22 09:56:12,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939738.0, ans=0.1 2023-06-22 09:56:59,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=939858.0, ans=0.0 2023-06-22 09:58:24,480 INFO [train.py:996] (0/4) Epoch 6, batch 4200, loss[loss=0.2611, simple_loss=0.3217, pruned_loss=0.1002, over 21450.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2964, pruned_loss=0.07092, over 4279129.61 frames. ], batch size: 473, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 09:59:11,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=940098.0, ans=0.125 2023-06-22 09:59:23,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-22 09:59:27,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=940158.0, ans=0.2 2023-06-22 09:59:32,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.356e+02 2.693e+02 3.335e+02 6.713e+02, threshold=5.385e+02, percent-clipped=2.0 2023-06-22 10:00:50,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=940278.0, ans=0.2 2023-06-22 10:00:52,416 INFO [train.py:996] (0/4) Epoch 6, batch 4250, loss[loss=0.249, simple_loss=0.3459, pruned_loss=0.07604, over 21854.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3024, pruned_loss=0.07334, over 4273611.64 frames. ], batch size: 317, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:00:54,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=940338.0, ans=0.125 2023-06-22 10:01:10,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-22 10:01:21,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=940398.0, ans=0.125 2023-06-22 10:01:54,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=940458.0, ans=0.125 2023-06-22 10:02:26,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=940518.0, ans=0.125 2023-06-22 10:02:42,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=940518.0, ans=0.125 2023-06-22 10:03:09,329 INFO [train.py:996] (0/4) Epoch 6, batch 4300, loss[loss=0.2242, simple_loss=0.3131, pruned_loss=0.06768, over 21830.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3105, pruned_loss=0.07619, over 4270342.99 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:03:30,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=940638.0, ans=0.125 2023-06-22 10:03:59,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=940698.0, ans=0.125 2023-06-22 10:04:41,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.64 vs. limit=22.5 2023-06-22 10:04:43,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.946e+02 3.391e+02 4.074e+02 6.738e+02, threshold=6.781e+02, percent-clipped=7.0 2023-06-22 10:05:35,229 INFO [train.py:996] (0/4) Epoch 6, batch 4350, loss[loss=0.2416, simple_loss=0.355, pruned_loss=0.06409, over 21271.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3108, pruned_loss=0.07571, over 4266525.54 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:06:17,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=940998.0, ans=0.125 2023-06-22 10:07:35,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=941178.0, ans=0.125 2023-06-22 10:07:52,257 INFO [train.py:996] (0/4) Epoch 6, batch 4400, loss[loss=0.2471, simple_loss=0.371, pruned_loss=0.06156, over 19859.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3065, pruned_loss=0.07503, over 4256519.61 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:08:45,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=941298.0, ans=0.125 2023-06-22 10:09:18,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.562e+02 2.802e+02 3.458e+02 5.737e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-22 10:10:16,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=941538.0, ans=0.0 2023-06-22 10:10:17,128 INFO [train.py:996] (0/4) Epoch 6, batch 4450, loss[loss=0.2301, simple_loss=0.3026, pruned_loss=0.07878, over 21449.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3141, pruned_loss=0.07669, over 4267223.56 frames. ], batch size: 131, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:11:35,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=941658.0, ans=0.1 2023-06-22 10:11:56,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=941718.0, ans=0.015 2023-06-22 10:12:37,934 INFO [train.py:996] (0/4) Epoch 6, batch 4500, loss[loss=0.2587, simple_loss=0.346, pruned_loss=0.08566, over 21893.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3158, pruned_loss=0.079, over 4276425.18 frames. ], batch size: 371, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:12:45,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=941838.0, ans=0.125 2023-06-22 10:13:55,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.436e+02 2.759e+02 3.510e+02 5.897e+02, threshold=5.518e+02, percent-clipped=2.0 2023-06-22 10:14:03,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=942018.0, ans=0.035 2023-06-22 10:14:32,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-22 10:14:33,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-22 10:15:14,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=942138.0, ans=0.0 2023-06-22 10:15:15,313 INFO [train.py:996] (0/4) Epoch 6, batch 4550, loss[loss=0.2534, simple_loss=0.327, pruned_loss=0.08994, over 21314.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3185, pruned_loss=0.07986, over 4275116.53 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:15:23,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942138.0, ans=0.1 2023-06-22 10:15:50,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-22 10:15:54,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=15.0 2023-06-22 10:16:12,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=942258.0, ans=0.125 2023-06-22 10:17:15,164 INFO [train.py:996] (0/4) Epoch 6, batch 4600, loss[loss=0.1971, simple_loss=0.2767, pruned_loss=0.05875, over 21656.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3199, pruned_loss=0.08095, over 4274057.18 frames. ], batch size: 230, lr: 5.27e-03, grad_scale: 32.0 2023-06-22 10:17:30,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=942438.0, ans=0.125 2023-06-22 10:18:38,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.628e+02 3.061e+02 3.498e+02 7.398e+02, threshold=6.122e+02, percent-clipped=1.0 2023-06-22 10:18:39,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=942558.0, ans=0.125 2023-06-22 10:18:51,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=942618.0, ans=0.125 2023-06-22 10:18:57,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-22 10:19:01,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-22 10:19:27,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942678.0, ans=0.1 2023-06-22 10:19:41,655 INFO [train.py:996] (0/4) Epoch 6, batch 4650, loss[loss=0.1769, simple_loss=0.2441, pruned_loss=0.05478, over 21251.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3138, pruned_loss=0.0797, over 4269924.74 frames. ], batch size: 159, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:19:43,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=942738.0, ans=0.125 2023-06-22 10:19:58,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=942738.0, ans=0.0 2023-06-22 10:20:01,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942798.0, ans=0.125 2023-06-22 10:20:05,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=942798.0, ans=0.125 2023-06-22 10:21:03,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-22 10:21:47,935 INFO [train.py:996] (0/4) Epoch 6, batch 4700, loss[loss=0.1969, simple_loss=0.2628, pruned_loss=0.06545, over 21664.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3039, pruned_loss=0.07712, over 4270517.69 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:22:57,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.395e+02 2.699e+02 3.102e+02 5.296e+02, threshold=5.398e+02, percent-clipped=0.0 2023-06-22 10:23:58,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=943338.0, ans=0.125 2023-06-22 10:23:59,398 INFO [train.py:996] (0/4) Epoch 6, batch 4750, loss[loss=0.2422, simple_loss=0.293, pruned_loss=0.09567, over 20252.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2979, pruned_loss=0.07712, over 4270483.19 frames. ], batch size: 707, lr: 5.27e-03, grad_scale: 16.0 2023-06-22 10:25:02,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=943458.0, ans=0.125 2023-06-22 10:25:42,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=943578.0, ans=0.125 2023-06-22 10:26:16,871 INFO [train.py:996] (0/4) Epoch 6, batch 4800, loss[loss=0.2222, simple_loss=0.331, pruned_loss=0.05671, over 19793.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2979, pruned_loss=0.07736, over 4275427.81 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:27:03,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=943698.0, ans=0.125 2023-06-22 10:27:13,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-22 10:27:31,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.729e+02 2.951e+02 3.440e+02 4.423e+02, threshold=5.901e+02, percent-clipped=0.0 2023-06-22 10:28:23,755 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:28:26,340 INFO [train.py:996] (0/4) Epoch 6, batch 4850, loss[loss=0.1919, simple_loss=0.2736, pruned_loss=0.0551, over 21657.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2976, pruned_loss=0.07634, over 4277037.00 frames. ], batch size: 298, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:28:29,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=943938.0, ans=0.125 2023-06-22 10:29:00,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=6.0 2023-06-22 10:29:37,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=944058.0, ans=0.125 2023-06-22 10:29:49,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=944118.0, ans=0.125 2023-06-22 10:30:48,617 INFO [train.py:996] (0/4) Epoch 6, batch 4900, loss[loss=0.1864, simple_loss=0.2697, pruned_loss=0.05157, over 20108.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2983, pruned_loss=0.07703, over 4274240.35 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:30:53,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=944238.0, ans=0.125 2023-06-22 10:31:03,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=944238.0, ans=0.5 2023-06-22 10:31:11,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=944298.0, ans=0.0 2023-06-22 10:32:13,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.505e+02 2.707e+02 3.125e+02 4.814e+02, threshold=5.414e+02, percent-clipped=0.0 2023-06-22 10:32:16,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=944418.0, ans=0.2 2023-06-22 10:33:20,835 INFO [train.py:996] (0/4) Epoch 6, batch 4950, loss[loss=0.1791, simple_loss=0.2632, pruned_loss=0.0475, over 21314.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3025, pruned_loss=0.07492, over 4275867.07 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:33:54,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-22 10:35:00,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=944718.0, ans=0.125 2023-06-22 10:35:19,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=944778.0, ans=0.125 2023-06-22 10:35:31,665 INFO [train.py:996] (0/4) Epoch 6, batch 5000, loss[loss=0.2097, simple_loss=0.2854, pruned_loss=0.06701, over 21455.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3022, pruned_loss=0.07218, over 4279043.52 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:36:25,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=944958.0, ans=0.125 2023-06-22 10:36:31,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-22 10:36:32,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944958.0, ans=0.1 2023-06-22 10:36:52,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.481e+02 2.862e+02 3.375e+02 4.928e+02, threshold=5.725e+02, percent-clipped=0.0 2023-06-22 10:37:47,855 INFO [train.py:996] (0/4) Epoch 6, batch 5050, loss[loss=0.2225, simple_loss=0.2933, pruned_loss=0.07582, over 21933.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3024, pruned_loss=0.07465, over 4291084.94 frames. ], batch size: 333, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:37:48,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=945138.0, ans=0.05 2023-06-22 10:39:10,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945318.0, ans=0.125 2023-06-22 10:40:04,976 INFO [train.py:996] (0/4) Epoch 6, batch 5100, loss[loss=0.2104, simple_loss=0.2845, pruned_loss=0.0682, over 21861.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3014, pruned_loss=0.07461, over 4285776.85 frames. ], batch size: 124, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:40:26,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945498.0, ans=0.125 2023-06-22 10:41:08,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=945558.0, ans=0.125 2023-06-22 10:41:18,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.681e+02 3.101e+02 3.777e+02 6.060e+02, threshold=6.201e+02, percent-clipped=2.0 2023-06-22 10:41:59,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-22 10:42:14,417 INFO [train.py:996] (0/4) Epoch 6, batch 5150, loss[loss=0.2455, simple_loss=0.3235, pruned_loss=0.08373, over 21400.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3023, pruned_loss=0.07579, over 4289213.27 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 16.0 2023-06-22 10:43:05,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-06-22 10:43:29,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=945858.0, ans=0.125 2023-06-22 10:43:44,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=945918.0, ans=0.015 2023-06-22 10:44:33,441 INFO [train.py:996] (0/4) Epoch 6, batch 5200, loss[loss=0.2, simple_loss=0.2614, pruned_loss=0.06926, over 21335.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3031, pruned_loss=0.07659, over 4286283.13 frames. ], batch size: 176, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:45:07,867 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:45:23,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=946098.0, ans=12.0 2023-06-22 10:45:32,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946098.0, ans=0.1 2023-06-22 10:45:37,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=946158.0, ans=0.2 2023-06-22 10:45:56,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.577e+02 3.076e+02 3.772e+02 6.113e+02, threshold=6.153e+02, percent-clipped=0.0 2023-06-22 10:46:41,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 10:46:50,505 INFO [train.py:996] (0/4) Epoch 6, batch 5250, loss[loss=0.1713, simple_loss=0.239, pruned_loss=0.0518, over 21821.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3068, pruned_loss=0.0756, over 4284267.07 frames. ], batch size: 102, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:49:19,178 INFO [train.py:996] (0/4) Epoch 6, batch 5300, loss[loss=0.2353, simple_loss=0.3014, pruned_loss=0.08463, over 21893.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3061, pruned_loss=0.07684, over 4289844.31 frames. ], batch size: 351, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:49:35,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=946638.0, ans=0.125 2023-06-22 10:49:36,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946638.0, ans=0.1 2023-06-22 10:49:38,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-06-22 10:50:13,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=946698.0, ans=0.0 2023-06-22 10:50:15,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946758.0, ans=0.1 2023-06-22 10:50:18,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=946758.0, ans=0.125 2023-06-22 10:50:30,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.544e+02 2.916e+02 3.415e+02 4.967e+02, threshold=5.832e+02, percent-clipped=0.0 2023-06-22 10:51:03,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=946878.0, ans=0.05 2023-06-22 10:51:14,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=946878.0, ans=0.125 2023-06-22 10:51:17,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=946878.0, ans=0.125 2023-06-22 10:51:25,705 INFO [train.py:996] (0/4) Epoch 6, batch 5350, loss[loss=0.2189, simple_loss=0.2876, pruned_loss=0.07511, over 21903.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3048, pruned_loss=0.07809, over 4294804.39 frames. ], batch size: 316, lr: 5.26e-03, grad_scale: 32.0 2023-06-22 10:52:21,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-22 10:52:49,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=947118.0, ans=0.125 2023-06-22 10:52:51,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-22 10:53:05,542 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:54:00,185 INFO [train.py:996] (0/4) Epoch 6, batch 5400, loss[loss=0.2034, simple_loss=0.266, pruned_loss=0.07038, over 21651.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3048, pruned_loss=0.07818, over 4283663.36 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:54:08,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.14 vs. limit=6.0 2023-06-22 10:54:28,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=947298.0, ans=0.0 2023-06-22 10:54:30,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=947298.0, ans=0.09899494936611666 2023-06-22 10:55:06,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=947358.0, ans=0.0 2023-06-22 10:55:11,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-22 10:55:21,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.657e+02 2.997e+02 3.766e+02 7.720e+02, threshold=5.994e+02, percent-clipped=1.0 2023-06-22 10:56:10,773 INFO [train.py:996] (0/4) Epoch 6, batch 5450, loss[loss=0.2137, simple_loss=0.2792, pruned_loss=0.07407, over 21181.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3043, pruned_loss=0.07625, over 4281732.75 frames. ], batch size: 608, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:56:31,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=947538.0, ans=0.125 2023-06-22 10:56:31,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-22 10:57:31,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=947658.0, ans=0.125 2023-06-22 10:57:31,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=15.0 2023-06-22 10:58:02,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947718.0, ans=0.1 2023-06-22 10:58:26,439 INFO [train.py:996] (0/4) Epoch 6, batch 5500, loss[loss=0.1983, simple_loss=0.2984, pruned_loss=0.04912, over 21658.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3091, pruned_loss=0.07341, over 4279133.58 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 10:59:03,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-22 10:59:54,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.313e+02 2.665e+02 3.124e+02 5.281e+02, threshold=5.330e+02, percent-clipped=0.0 2023-06-22 11:00:36,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948078.0, ans=0.1 2023-06-22 11:00:46,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=948138.0, ans=0.125 2023-06-22 11:00:47,681 INFO [train.py:996] (0/4) Epoch 6, batch 5550, loss[loss=0.2179, simple_loss=0.2917, pruned_loss=0.07203, over 21016.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3076, pruned_loss=0.06962, over 4272008.10 frames. ], batch size: 607, lr: 5.25e-03, grad_scale: 16.0 2023-06-22 11:01:18,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=948138.0, ans=0.125 2023-06-22 11:01:44,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=948198.0, ans=0.0 2023-06-22 11:02:36,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=948378.0, ans=0.0 2023-06-22 11:02:37,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=948378.0, ans=0.035 2023-06-22 11:02:51,121 INFO [train.py:996] (0/4) Epoch 6, batch 5600, loss[loss=0.2011, simple_loss=0.2808, pruned_loss=0.0607, over 21149.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3036, pruned_loss=0.06655, over 4278984.24 frames. ], batch size: 143, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:03:38,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=948498.0, ans=0.125 2023-06-22 11:04:21,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.304e+02 2.857e+02 3.382e+02 5.869e+02, threshold=5.713e+02, percent-clipped=3.0 2023-06-22 11:05:07,820 INFO [train.py:996] (0/4) Epoch 6, batch 5650, loss[loss=0.2344, simple_loss=0.3067, pruned_loss=0.08104, over 21888.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3073, pruned_loss=0.06842, over 4274985.64 frames. ], batch size: 351, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:05:31,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-22 11:05:34,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=948738.0, ans=0.0 2023-06-22 11:06:24,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-22 11:06:45,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=948918.0, ans=0.125 2023-06-22 11:07:46,661 INFO [train.py:996] (0/4) Epoch 6, batch 5700, loss[loss=0.2451, simple_loss=0.3067, pruned_loss=0.09177, over 21607.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3086, pruned_loss=0.07098, over 4275456.23 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:08:04,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-22 11:08:21,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=949098.0, ans=0.0 2023-06-22 11:08:27,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-22 11:09:03,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.632e+02 3.011e+02 3.527e+02 5.738e+02, threshold=6.022e+02, percent-clipped=1.0 2023-06-22 11:10:03,782 INFO [train.py:996] (0/4) Epoch 6, batch 5750, loss[loss=0.1764, simple_loss=0.2698, pruned_loss=0.04155, over 21778.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3056, pruned_loss=0.07004, over 4282032.71 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:10:58,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-22 11:12:21,477 INFO [train.py:996] (0/4) Epoch 6, batch 5800, loss[loss=0.3296, simple_loss=0.4047, pruned_loss=0.1273, over 21501.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3051, pruned_loss=0.06849, over 4283499.97 frames. ], batch size: 508, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:12:27,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=949638.0, ans=0.125 2023-06-22 11:12:51,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=949698.0, ans=0.2 2023-06-22 11:13:53,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=949818.0, ans=0.125 2023-06-22 11:14:01,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.381e+02 2.868e+02 4.054e+02 6.693e+02, threshold=5.736e+02, percent-clipped=1.0 2023-06-22 11:14:58,420 INFO [train.py:996] (0/4) Epoch 6, batch 5850, loss[loss=0.1601, simple_loss=0.2369, pruned_loss=0.04164, over 21900.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.3031, pruned_loss=0.06483, over 4287118.37 frames. ], batch size: 107, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:15:41,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949998.0, ans=0.1 2023-06-22 11:16:06,947 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:16:47,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=950178.0, ans=0.0 2023-06-22 11:17:03,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=950178.0, ans=0.2 2023-06-22 11:17:07,183 INFO [train.py:996] (0/4) Epoch 6, batch 5900, loss[loss=0.1909, simple_loss=0.2678, pruned_loss=0.05697, over 21779.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2964, pruned_loss=0.06028, over 4280987.76 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:18:39,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.954e+02 2.379e+02 3.002e+02 5.426e+02, threshold=4.759e+02, percent-clipped=0.0 2023-06-22 11:19:22,076 INFO [train.py:996] (0/4) Epoch 6, batch 5950, loss[loss=0.2106, simple_loss=0.2775, pruned_loss=0.07189, over 21746.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2955, pruned_loss=0.06402, over 4278768.47 frames. ], batch size: 333, lr: 5.25e-03, grad_scale: 32.0 2023-06-22 11:20:03,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=950598.0, ans=0.125 2023-06-22 11:20:38,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950658.0, ans=0.1 2023-06-22 11:20:46,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950718.0, ans=0.1 2023-06-22 11:20:53,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=950718.0, ans=0.2 2023-06-22 11:21:03,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=950718.0, ans=0.125 2023-06-22 11:21:06,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-22 11:21:37,698 INFO [train.py:996] (0/4) Epoch 6, batch 6000, loss[loss=0.1863, simple_loss=0.2535, pruned_loss=0.05956, over 21753.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2909, pruned_loss=0.06719, over 4285071.16 frames. ], batch size: 112, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:21:37,701 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 11:22:41,435 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2615, simple_loss=0.3543, pruned_loss=0.08434, over 1796401.00 frames. 2023-06-22 11:22:41,437 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23752MB 2023-06-22 11:23:20,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=950898.0, ans=0.125 2023-06-22 11:23:22,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=950898.0, ans=0.0 2023-06-22 11:23:25,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=950958.0, ans=0.0 2023-06-22 11:23:40,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=950958.0, ans=0.125 2023-06-22 11:23:47,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.609e+02 2.903e+02 3.362e+02 5.705e+02, threshold=5.807e+02, percent-clipped=2.0 2023-06-22 11:23:58,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=951018.0, ans=0.125 2023-06-22 11:24:13,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=951018.0, ans=0.125 2023-06-22 11:24:23,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-22 11:24:26,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=951078.0, ans=0.0 2023-06-22 11:24:26,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 11:24:30,286 INFO [train.py:996] (0/4) Epoch 6, batch 6050, loss[loss=0.1645, simple_loss=0.243, pruned_loss=0.04298, over 21696.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2855, pruned_loss=0.06771, over 4278807.25 frames. ], batch size: 247, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:26:00,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=951258.0, ans=22.5 2023-06-22 11:26:04,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=951258.0, ans=0.125 2023-06-22 11:26:42,252 INFO [train.py:996] (0/4) Epoch 6, batch 6100, loss[loss=0.1868, simple_loss=0.284, pruned_loss=0.0448, over 21803.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2832, pruned_loss=0.06643, over 4282692.61 frames. ], batch size: 371, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:27:31,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=951498.0, ans=0.0 2023-06-22 11:27:35,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-22 11:28:19,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.238e+02 2.454e+02 2.758e+02 3.934e+02, threshold=4.908e+02, percent-clipped=0.0 2023-06-22 11:28:27,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=951618.0, ans=0.125 2023-06-22 11:28:37,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=951678.0, ans=0.125 2023-06-22 11:28:59,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=951738.0, ans=0.0 2023-06-22 11:28:59,924 INFO [train.py:996] (0/4) Epoch 6, batch 6150, loss[loss=0.1994, simple_loss=0.2713, pruned_loss=0.06378, over 21527.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2867, pruned_loss=0.06956, over 4285975.11 frames. ], batch size: 195, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:30:46,104 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:31:17,601 INFO [train.py:996] (0/4) Epoch 6, batch 6200, loss[loss=0.2126, simple_loss=0.2871, pruned_loss=0.06907, over 21381.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2891, pruned_loss=0.06893, over 4277167.59 frames. ], batch size: 159, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:31:27,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-22 11:32:48,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-22 11:32:50,652 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.387e+02 2.806e+02 3.164e+02 6.088e+02, threshold=5.612e+02, percent-clipped=2.0 2023-06-22 11:33:34,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=952278.0, ans=0.125 2023-06-22 11:33:45,684 INFO [train.py:996] (0/4) Epoch 6, batch 6250, loss[loss=0.2161, simple_loss=0.3205, pruned_loss=0.05581, over 21784.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2946, pruned_loss=0.06786, over 4273342.17 frames. ], batch size: 332, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:34:51,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=952458.0, ans=0.125 2023-06-22 11:35:14,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 11:35:28,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=952578.0, ans=0.125 2023-06-22 11:35:49,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-22 11:36:01,122 INFO [train.py:996] (0/4) Epoch 6, batch 6300, loss[loss=0.2816, simple_loss=0.4027, pruned_loss=0.08022, over 20816.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2989, pruned_loss=0.06808, over 4267915.19 frames. ], batch size: 607, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:36:32,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=952698.0, ans=0.125 2023-06-22 11:36:33,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=952698.0, ans=0.0 2023-06-22 11:37:00,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-22 11:37:16,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952818.0, ans=0.125 2023-06-22 11:37:18,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.436e+02 3.137e+02 3.711e+02 6.138e+02, threshold=6.275e+02, percent-clipped=3.0 2023-06-22 11:38:10,542 INFO [train.py:996] (0/4) Epoch 6, batch 6350, loss[loss=0.2399, simple_loss=0.3144, pruned_loss=0.08266, over 21807.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3031, pruned_loss=0.07337, over 4276822.91 frames. ], batch size: 282, lr: 5.24e-03, grad_scale: 16.0 2023-06-22 11:38:20,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=952938.0, ans=0.0 2023-06-22 11:38:22,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952938.0, ans=0.1 2023-06-22 11:38:26,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=952938.0, ans=0.125 2023-06-22 11:38:33,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=952938.0, ans=0.2 2023-06-22 11:38:33,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=952938.0, ans=0.125 2023-06-22 11:39:34,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=953118.0, ans=0.125 2023-06-22 11:40:25,791 INFO [train.py:996] (0/4) Epoch 6, batch 6400, loss[loss=0.2869, simple_loss=0.3504, pruned_loss=0.1117, over 21821.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3101, pruned_loss=0.07831, over 4276213.58 frames. ], batch size: 441, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:40:38,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=953238.0, ans=0.125 2023-06-22 11:41:00,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953298.0, ans=0.1 2023-06-22 11:41:25,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=953358.0, ans=0.125 2023-06-22 11:41:41,718 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:42:05,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.715e+02 2.941e+02 3.415e+02 4.411e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-22 11:42:11,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=953418.0, ans=0.125 2023-06-22 11:42:49,199 INFO [train.py:996] (0/4) Epoch 6, batch 6450, loss[loss=0.188, simple_loss=0.2627, pruned_loss=0.05666, over 21815.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3122, pruned_loss=0.07699, over 4277995.63 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:43:34,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953658.0, ans=0.1 2023-06-22 11:44:00,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=953658.0, ans=0.125 2023-06-22 11:44:39,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=953778.0, ans=0.04949747468305833 2023-06-22 11:45:02,723 INFO [train.py:996] (0/4) Epoch 6, batch 6500, loss[loss=0.1786, simple_loss=0.2549, pruned_loss=0.05119, over 21533.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3063, pruned_loss=0.07522, over 4272708.12 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:45:13,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=953838.0, ans=0.125 2023-06-22 11:45:51,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 11:46:19,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=954018.0, ans=0.025 2023-06-22 11:46:23,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.489e+02 2.776e+02 3.304e+02 5.891e+02, threshold=5.553e+02, percent-clipped=1.0 2023-06-22 11:46:49,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=954078.0, ans=0.0 2023-06-22 11:47:16,599 INFO [train.py:996] (0/4) Epoch 6, batch 6550, loss[loss=0.1999, simple_loss=0.282, pruned_loss=0.05888, over 21589.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3049, pruned_loss=0.07415, over 4261659.27 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 32.0 2023-06-22 11:47:24,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-22 11:47:31,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=954198.0, ans=0.0 2023-06-22 11:47:36,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.38 vs. limit=22.5 2023-06-22 11:47:50,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=954198.0, ans=0.125 2023-06-22 11:48:41,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-22 11:48:46,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=954318.0, ans=0.125 2023-06-22 11:48:56,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=954378.0, ans=0.125 2023-06-22 11:48:58,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-06-22 11:49:05,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=954378.0, ans=0.07 2023-06-22 11:49:17,333 INFO [train.py:996] (0/4) Epoch 6, batch 6600, loss[loss=0.2101, simple_loss=0.2693, pruned_loss=0.07546, over 21799.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2996, pruned_loss=0.07384, over 4273367.52 frames. ], batch size: 98, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:49:54,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=954498.0, ans=0.2 2023-06-22 11:50:17,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=954558.0, ans=0.0 2023-06-22 11:50:38,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.307e+02 2.568e+02 2.890e+02 5.547e+02, threshold=5.135e+02, percent-clipped=0.0 2023-06-22 11:50:40,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-22 11:51:14,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-22 11:51:14,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=954678.0, ans=0.2 2023-06-22 11:51:27,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=954678.0, ans=0.125 2023-06-22 11:51:31,387 INFO [train.py:996] (0/4) Epoch 6, batch 6650, loss[loss=0.2142, simple_loss=0.2564, pruned_loss=0.08601, over 20111.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2922, pruned_loss=0.07131, over 4271586.69 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:52:08,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=954798.0, ans=0.125 2023-06-22 11:53:49,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-22 11:53:52,924 INFO [train.py:996] (0/4) Epoch 6, batch 6700, loss[loss=0.1806, simple_loss=0.2592, pruned_loss=0.05098, over 21817.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2859, pruned_loss=0.07111, over 4267554.25 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:55:20,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.337e+02 2.565e+02 3.029e+02 4.063e+02, threshold=5.130e+02, percent-clipped=0.0 2023-06-22 11:56:01,264 INFO [train.py:996] (0/4) Epoch 6, batch 6750, loss[loss=0.2039, simple_loss=0.2679, pruned_loss=0.07001, over 21263.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2828, pruned_loss=0.07097, over 4263227.58 frames. ], batch size: 176, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 11:56:58,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=955458.0, ans=0.07 2023-06-22 11:57:05,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=955458.0, ans=0.125 2023-06-22 11:58:03,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=955578.0, ans=0.0 2023-06-22 11:58:11,988 INFO [train.py:996] (0/4) Epoch 6, batch 6800, loss[loss=0.2447, simple_loss=0.3035, pruned_loss=0.09292, over 21883.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2853, pruned_loss=0.07328, over 4275243.52 frames. ], batch size: 98, lr: 5.23e-03, grad_scale: 32.0 2023-06-22 11:58:43,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-22 11:59:06,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=955698.0, ans=0.0 2023-06-22 11:59:07,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-22 11:59:49,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.516e+02 2.922e+02 3.493e+02 5.598e+02, threshold=5.845e+02, percent-clipped=3.0 2023-06-22 11:59:50,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-22 11:59:53,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=955818.0, ans=0.04949747468305833 2023-06-22 12:00:24,114 INFO [train.py:996] (0/4) Epoch 6, batch 6850, loss[loss=0.2539, simple_loss=0.2905, pruned_loss=0.1086, over 21463.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2841, pruned_loss=0.07437, over 4270976.27 frames. ], batch size: 509, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:00:34,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=955938.0, ans=0.125 2023-06-22 12:00:39,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-22 12:02:21,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=22.5 2023-06-22 12:02:52,554 INFO [train.py:996] (0/4) Epoch 6, batch 6900, loss[loss=0.2473, simple_loss=0.3267, pruned_loss=0.08394, over 21622.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2865, pruned_loss=0.07454, over 4280534.15 frames. ], batch size: 508, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:03:02,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=956238.0, ans=0.0 2023-06-22 12:03:03,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=956238.0, ans=0.125 2023-06-22 12:04:50,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.414e+02 2.854e+02 3.721e+02 5.667e+02, threshold=5.709e+02, percent-clipped=0.0 2023-06-22 12:04:51,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-22 12:05:05,871 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:05:09,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-22 12:05:19,760 INFO [train.py:996] (0/4) Epoch 6, batch 6950, loss[loss=0.2527, simple_loss=0.3289, pruned_loss=0.08821, over 21720.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2885, pruned_loss=0.07157, over 4275034.80 frames. ], batch size: 332, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:05:39,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=956538.0, ans=0.0 2023-06-22 12:05:40,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=956538.0, ans=0.0 2023-06-22 12:06:03,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=956598.0, ans=0.125 2023-06-22 12:06:20,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=956658.0, ans=0.0 2023-06-22 12:07:07,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-22 12:07:18,201 INFO [train.py:996] (0/4) Epoch 6, batch 7000, loss[loss=0.2237, simple_loss=0.2835, pruned_loss=0.08191, over 21447.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2912, pruned_loss=0.07352, over 4279768.30 frames. ], batch size: 389, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:08:17,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=956898.0, ans=0.125 2023-06-22 12:08:19,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=956898.0, ans=0.125 2023-06-22 12:08:40,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=956958.0, ans=0.125 2023-06-22 12:08:59,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.490e+02 2.766e+02 3.246e+02 6.090e+02, threshold=5.532e+02, percent-clipped=1.0 2023-06-22 12:09:03,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=957018.0, ans=0.125 2023-06-22 12:09:06,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=957018.0, ans=0.0 2023-06-22 12:09:11,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-22 12:09:37,434 INFO [train.py:996] (0/4) Epoch 6, batch 7050, loss[loss=0.2062, simple_loss=0.2919, pruned_loss=0.06025, over 21730.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2879, pruned_loss=0.07199, over 4282568.73 frames. ], batch size: 351, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:09:46,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=957138.0, ans=0.05 2023-06-22 12:09:48,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-22 12:10:09,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=957198.0, ans=0.0 2023-06-22 12:11:19,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957318.0, ans=0.0 2023-06-22 12:11:59,671 INFO [train.py:996] (0/4) Epoch 6, batch 7100, loss[loss=0.1927, simple_loss=0.2718, pruned_loss=0.05678, over 21301.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2935, pruned_loss=0.07353, over 4285950.11 frames. ], batch size: 176, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:12:30,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=15.0 2023-06-22 12:13:01,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957558.0, ans=0.1 2023-06-22 12:13:07,412 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:13:26,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.296e+02 2.660e+02 3.134e+02 4.737e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-22 12:13:53,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=957678.0, ans=0.0 2023-06-22 12:14:15,111 INFO [train.py:996] (0/4) Epoch 6, batch 7150, loss[loss=0.1954, simple_loss=0.2757, pruned_loss=0.05757, over 21763.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2902, pruned_loss=0.07062, over 4274269.36 frames. ], batch size: 332, lr: 5.23e-03, grad_scale: 16.0 2023-06-22 12:15:11,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-22 12:15:33,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=957918.0, ans=0.125 2023-06-22 12:15:59,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=957978.0, ans=0.125 2023-06-22 12:16:21,675 INFO [train.py:996] (0/4) Epoch 6, batch 7200, loss[loss=0.2068, simple_loss=0.3148, pruned_loss=0.04946, over 19707.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2935, pruned_loss=0.07271, over 4269199.81 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 32.0 2023-06-22 12:16:31,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=958038.0, ans=15.0 2023-06-22 12:16:52,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=958038.0, ans=0.02 2023-06-22 12:17:30,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-22 12:17:34,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=958158.0, ans=0.05 2023-06-22 12:17:51,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.534e+02 2.859e+02 3.479e+02 6.830e+02, threshold=5.718e+02, percent-clipped=3.0 2023-06-22 12:18:08,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=958278.0, ans=0.125 2023-06-22 12:18:29,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=958338.0, ans=0.125 2023-06-22 12:18:29,939 INFO [train.py:996] (0/4) Epoch 6, batch 7250, loss[loss=0.2219, simple_loss=0.3287, pruned_loss=0.05752, over 19813.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2906, pruned_loss=0.07284, over 4260735.25 frames. ], batch size: 703, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:18:31,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958338.0, ans=0.1 2023-06-22 12:19:50,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=958518.0, ans=0.07 2023-06-22 12:20:38,077 INFO [train.py:996] (0/4) Epoch 6, batch 7300, loss[loss=0.1823, simple_loss=0.2804, pruned_loss=0.04207, over 20781.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2854, pruned_loss=0.072, over 4267853.53 frames. ], batch size: 609, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:20:52,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958638.0, ans=0.1 2023-06-22 12:21:27,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958698.0, ans=0.0 2023-06-22 12:22:08,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.524e+02 2.909e+02 3.495e+02 5.392e+02, threshold=5.818e+02, percent-clipped=0.0 2023-06-22 12:22:08,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=958818.0, ans=0.05 2023-06-22 12:22:30,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-22 12:22:45,858 INFO [train.py:996] (0/4) Epoch 6, batch 7350, loss[loss=0.2456, simple_loss=0.3108, pruned_loss=0.09018, over 21550.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2845, pruned_loss=0.07353, over 4262474.11 frames. ], batch size: 389, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:22:48,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-22 12:24:30,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-22 12:24:40,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=959178.0, ans=0.02 2023-06-22 12:25:10,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959238.0, ans=0.1 2023-06-22 12:25:17,235 INFO [train.py:996] (0/4) Epoch 6, batch 7400, loss[loss=0.211, simple_loss=0.2944, pruned_loss=0.06381, over 21690.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2909, pruned_loss=0.07609, over 4265775.09 frames. ], batch size: 247, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:25:19,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=959238.0, ans=0.125 2023-06-22 12:25:35,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=959238.0, ans=15.0 2023-06-22 12:26:43,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=959418.0, ans=0.125 2023-06-22 12:26:45,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-22 12:26:47,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.566e+02 3.026e+02 3.551e+02 6.030e+02, threshold=6.052e+02, percent-clipped=1.0 2023-06-22 12:26:58,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-22 12:27:23,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=959538.0, ans=0.125 2023-06-22 12:27:30,397 INFO [train.py:996] (0/4) Epoch 6, batch 7450, loss[loss=0.2575, simple_loss=0.3039, pruned_loss=0.1055, over 21371.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2889, pruned_loss=0.07464, over 4269184.82 frames. ], batch size: 473, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:27:48,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=959538.0, ans=0.125 2023-06-22 12:29:34,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=959778.0, ans=0.5 2023-06-22 12:29:37,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=959838.0, ans=10.0 2023-06-22 12:29:38,270 INFO [train.py:996] (0/4) Epoch 6, batch 7500, loss[loss=0.2984, simple_loss=0.3894, pruned_loss=0.1037, over 21655.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2953, pruned_loss=0.07675, over 4270739.78 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:30:04,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-22 12:30:13,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-22 12:31:05,732 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-160000.pt 2023-06-22 12:31:12,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=959958.0, ans=0.0 2023-06-22 12:31:28,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.801e+02 3.373e+02 4.203e+02 7.469e+02, threshold=6.746e+02, percent-clipped=3.0 2023-06-22 12:31:32,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-22 12:32:01,773 INFO [train.py:996] (0/4) Epoch 6, batch 7550, loss[loss=0.2093, simple_loss=0.3029, pruned_loss=0.0579, over 21639.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3029, pruned_loss=0.07493, over 4280527.67 frames. ], batch size: 230, lr: 5.22e-03, grad_scale: 16.0 2023-06-22 12:32:10,539 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:32:51,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-22 12:32:55,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=960258.0, ans=0.0 2023-06-22 12:33:54,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960378.0, ans=0.1 2023-06-22 12:34:08,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=960378.0, ans=0.125 2023-06-22 12:34:12,403 INFO [train.py:996] (0/4) Epoch 6, batch 7600, loss[loss=0.2353, simple_loss=0.3011, pruned_loss=0.08477, over 21359.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3015, pruned_loss=0.07365, over 4286280.91 frames. ], batch size: 143, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:35:10,950 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:35:36,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=960558.0, ans=0.125 2023-06-22 12:35:36,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960558.0, ans=0.1 2023-06-22 12:35:50,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-22 12:35:59,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.444e+02 2.746e+02 3.359e+02 4.824e+02, threshold=5.491e+02, percent-clipped=0.0 2023-06-22 12:36:35,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=960678.0, ans=0.0 2023-06-22 12:36:39,357 INFO [train.py:996] (0/4) Epoch 6, batch 7650, loss[loss=0.2524, simple_loss=0.321, pruned_loss=0.09192, over 21882.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2995, pruned_loss=0.07466, over 4283081.49 frames. ], batch size: 118, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:36:41,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960738.0, ans=0.1 2023-06-22 12:36:50,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=960738.0, ans=0.125 2023-06-22 12:36:53,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=960798.0, ans=0.0 2023-06-22 12:37:06,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=960798.0, ans=0.0 2023-06-22 12:37:09,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=960798.0, ans=0.125 2023-06-22 12:37:10,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=960798.0, ans=0.125 2023-06-22 12:37:26,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960858.0, ans=0.1 2023-06-22 12:37:27,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=960858.0, ans=0.125 2023-06-22 12:38:40,602 INFO [train.py:996] (0/4) Epoch 6, batch 7700, loss[loss=0.2395, simple_loss=0.3009, pruned_loss=0.08903, over 21817.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3024, pruned_loss=0.07758, over 4289119.91 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:39:22,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-22 12:40:18,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.597e+02 2.996e+02 3.499e+02 4.592e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-22 12:41:02,114 INFO [train.py:996] (0/4) Epoch 6, batch 7750, loss[loss=0.2666, simple_loss=0.3624, pruned_loss=0.0854, over 21748.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07835, over 4288290.91 frames. ], batch size: 351, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:41:04,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=961338.0, ans=0.2 2023-06-22 12:41:35,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=961398.0, ans=0.125 2023-06-22 12:41:39,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=961398.0, ans=0.1 2023-06-22 12:43:03,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=961578.0, ans=0.125 2023-06-22 12:43:18,986 INFO [train.py:996] (0/4) Epoch 6, batch 7800, loss[loss=0.2061, simple_loss=0.2696, pruned_loss=0.07134, over 21570.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3117, pruned_loss=0.07907, over 4289802.89 frames. ], batch size: 230, lr: 5.22e-03, grad_scale: 32.0 2023-06-22 12:44:18,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=961758.0, ans=10.0 2023-06-22 12:44:40,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.813e+02 3.314e+02 4.119e+02 8.453e+02, threshold=6.627e+02, percent-clipped=5.0 2023-06-22 12:44:52,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=961878.0, ans=0.125 2023-06-22 12:45:23,670 INFO [train.py:996] (0/4) Epoch 6, batch 7850, loss[loss=0.2129, simple_loss=0.2606, pruned_loss=0.08265, over 20317.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3039, pruned_loss=0.07814, over 4269140.16 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:46:20,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-22 12:46:31,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=962118.0, ans=0.125 2023-06-22 12:46:56,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-22 12:46:58,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962178.0, ans=0.1 2023-06-22 12:47:34,169 INFO [train.py:996] (0/4) Epoch 6, batch 7900, loss[loss=0.1479, simple_loss=0.1801, pruned_loss=0.0578, over 16214.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2992, pruned_loss=0.07814, over 4255170.75 frames. ], batch size: 61, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:48:38,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=962358.0, ans=0.125 2023-06-22 12:48:49,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-22 12:49:17,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.834e+02 3.343e+02 3.831e+02 7.219e+02, threshold=6.686e+02, percent-clipped=3.0 2023-06-22 12:49:49,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=962478.0, ans=0.0 2023-06-22 12:49:55,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=962478.0, ans=0.2 2023-06-22 12:50:10,560 INFO [train.py:996] (0/4) Epoch 6, batch 7950, loss[loss=0.1946, simple_loss=0.2644, pruned_loss=0.06234, over 20773.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3036, pruned_loss=0.0772, over 4251787.08 frames. ], batch size: 609, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 12:52:36,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-22 12:52:48,112 INFO [train.py:996] (0/4) Epoch 6, batch 8000, loss[loss=0.2414, simple_loss=0.3585, pruned_loss=0.0621, over 20769.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3092, pruned_loss=0.07866, over 4253849.04 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:53:06,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=962838.0, ans=0.0 2023-06-22 12:53:16,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=962898.0, ans=0.0 2023-06-22 12:53:30,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=962958.0, ans=10.0 2023-06-22 12:54:34,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.700e+02 3.556e+02 4.480e+02 7.069e+02, threshold=7.112e+02, percent-clipped=3.0 2023-06-22 12:55:13,308 INFO [train.py:996] (0/4) Epoch 6, batch 8050, loss[loss=0.2935, simple_loss=0.3763, pruned_loss=0.1054, over 21613.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3144, pruned_loss=0.08, over 4251239.62 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:56:02,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=963198.0, ans=0.125 2023-06-22 12:57:34,238 INFO [train.py:996] (0/4) Epoch 6, batch 8100, loss[loss=0.2277, simple_loss=0.2961, pruned_loss=0.07969, over 21015.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3116, pruned_loss=0.08012, over 4260667.23 frames. ], batch size: 608, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 12:57:35,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-22 12:59:45,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.718e+02 3.272e+02 3.962e+02 1.016e+03, threshold=6.543e+02, percent-clipped=3.0 2023-06-22 12:59:52,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-22 13:00:20,318 INFO [train.py:996] (0/4) Epoch 6, batch 8150, loss[loss=0.2379, simple_loss=0.3478, pruned_loss=0.06403, over 21779.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3187, pruned_loss=0.08097, over 4261007.69 frames. ], batch size: 352, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:00:21,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=963738.0, ans=0.2 2023-06-22 13:02:13,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=963978.0, ans=0.2 2023-06-22 13:02:23,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=963978.0, ans=0.125 2023-06-22 13:02:29,443 INFO [train.py:996] (0/4) Epoch 6, batch 8200, loss[loss=0.2632, simple_loss=0.3053, pruned_loss=0.1106, over 21415.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3103, pruned_loss=0.07911, over 4256106.80 frames. ], batch size: 509, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:03:52,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=964218.0, ans=0.125 2023-06-22 13:04:16,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.425e+02 2.871e+02 3.494e+02 6.098e+02, threshold=5.742e+02, percent-clipped=0.0 2023-06-22 13:04:33,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=964278.0, ans=0.125 2023-06-22 13:04:59,538 INFO [train.py:996] (0/4) Epoch 6, batch 8250, loss[loss=0.2132, simple_loss=0.2971, pruned_loss=0.06463, over 21431.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3068, pruned_loss=0.07888, over 4251519.06 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:05:10,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=964338.0, ans=0.125 2023-06-22 13:05:57,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=964458.0, ans=0.04949747468305833 2023-06-22 13:05:58,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=964458.0, ans=0.125 2023-06-22 13:05:59,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-22 13:06:01,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964458.0, ans=0.1 2023-06-22 13:07:14,452 INFO [train.py:996] (0/4) Epoch 6, batch 8300, loss[loss=0.216, simple_loss=0.2911, pruned_loss=0.07044, over 21245.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3048, pruned_loss=0.07543, over 4251054.89 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:08:35,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=964818.0, ans=0.025 2023-06-22 13:08:40,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.450e+02 2.816e+02 3.478e+02 6.310e+02, threshold=5.632e+02, percent-clipped=2.0 2023-06-22 13:09:32,568 INFO [train.py:996] (0/4) Epoch 6, batch 8350, loss[loss=0.2172, simple_loss=0.29, pruned_loss=0.07226, over 21784.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.304, pruned_loss=0.07307, over 4250627.78 frames. ], batch size: 372, lr: 5.21e-03, grad_scale: 16.0 2023-06-22 13:10:11,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=964998.0, ans=0.0 2023-06-22 13:11:02,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=965178.0, ans=0.0 2023-06-22 13:11:27,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=965178.0, ans=0.125 2023-06-22 13:11:45,370 INFO [train.py:996] (0/4) Epoch 6, batch 8400, loss[loss=0.2217, simple_loss=0.2662, pruned_loss=0.08859, over 20004.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3005, pruned_loss=0.07091, over 4241487.16 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 32.0 2023-06-22 13:13:14,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.326e+02 2.586e+02 3.002e+02 4.637e+02, threshold=5.171e+02, percent-clipped=0.0 2023-06-22 13:13:41,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.38 vs. limit=22.5 2023-06-22 13:13:50,942 INFO [train.py:996] (0/4) Epoch 6, batch 8450, loss[loss=0.2441, simple_loss=0.3096, pruned_loss=0.08926, over 21235.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2991, pruned_loss=0.07128, over 4254255.74 frames. ], batch size: 143, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:13:57,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=965538.0, ans=0.0 2023-06-22 13:14:27,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=22.5 2023-06-22 13:14:55,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=965658.0, ans=0.0 2023-06-22 13:15:24,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=965718.0, ans=10.0 2023-06-22 13:15:24,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=965718.0, ans=0.125 2023-06-22 13:15:27,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=965778.0, ans=0.125 2023-06-22 13:15:40,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-22 13:16:01,038 INFO [train.py:996] (0/4) Epoch 6, batch 8500, loss[loss=0.2189, simple_loss=0.2744, pruned_loss=0.0817, over 21209.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2954, pruned_loss=0.07325, over 4253785.39 frames. ], batch size: 159, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:16:39,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-22 13:17:28,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=966018.0, ans=0.0 2023-06-22 13:17:39,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=966018.0, ans=0.2 2023-06-22 13:17:48,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.784e+02 3.161e+02 3.760e+02 5.772e+02, threshold=6.322e+02, percent-clipped=2.0 2023-06-22 13:17:56,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=966078.0, ans=0.125 2023-06-22 13:18:27,235 INFO [train.py:996] (0/4) Epoch 6, batch 8550, loss[loss=0.2272, simple_loss=0.3158, pruned_loss=0.0693, over 21616.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2995, pruned_loss=0.07542, over 4256289.31 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:18:50,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=966138.0, ans=0.2 2023-06-22 13:19:05,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=966198.0, ans=0.05 2023-06-22 13:21:01,121 INFO [train.py:996] (0/4) Epoch 6, batch 8600, loss[loss=0.1881, simple_loss=0.2296, pruned_loss=0.07332, over 20018.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3082, pruned_loss=0.07814, over 4263817.97 frames. ], batch size: 704, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:21:05,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-22 13:21:06,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=966438.0, ans=0.125 2023-06-22 13:21:27,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=966498.0, ans=0.0 2023-06-22 13:22:18,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=966558.0, ans=0.125 2023-06-22 13:22:42,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=966618.0, ans=0.0 2023-06-22 13:22:45,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=966618.0, ans=0.0 2023-06-22 13:22:56,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.812e+02 3.242e+02 4.124e+02 6.124e+02, threshold=6.484e+02, percent-clipped=0.0 2023-06-22 13:23:27,232 INFO [train.py:996] (0/4) Epoch 6, batch 8650, loss[loss=0.2258, simple_loss=0.3016, pruned_loss=0.07502, over 20812.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3113, pruned_loss=0.07928, over 4258792.01 frames. ], batch size: 607, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:23:40,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=966738.0, ans=0.125 2023-06-22 13:23:42,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=966738.0, ans=0.125 2023-06-22 13:25:21,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=966978.0, ans=0.0 2023-06-22 13:25:25,290 INFO [train.py:996] (0/4) Epoch 6, batch 8700, loss[loss=0.2137, simple_loss=0.2818, pruned_loss=0.07284, over 21793.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3028, pruned_loss=0.07489, over 4253489.55 frames. ], batch size: 98, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:25:31,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967038.0, ans=0.1 2023-06-22 13:26:46,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=967218.0, ans=0.125 2023-06-22 13:26:56,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.253e+02 2.607e+02 2.953e+02 4.706e+02, threshold=5.214e+02, percent-clipped=0.0 2023-06-22 13:27:04,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-22 13:27:24,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967278.0, ans=0.1 2023-06-22 13:27:27,310 INFO [train.py:996] (0/4) Epoch 6, batch 8750, loss[loss=0.2151, simple_loss=0.2819, pruned_loss=0.07417, over 21472.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2992, pruned_loss=0.07478, over 4256975.05 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:28:00,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=967398.0, ans=0.1 2023-06-22 13:28:27,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967398.0, ans=0.1 2023-06-22 13:28:59,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967518.0, ans=0.1 2023-06-22 13:29:45,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967578.0, ans=0.1 2023-06-22 13:29:59,543 INFO [train.py:996] (0/4) Epoch 6, batch 8800, loss[loss=0.2875, simple_loss=0.3681, pruned_loss=0.1034, over 21847.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3086, pruned_loss=0.07803, over 4260023.92 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 32.0 2023-06-22 13:30:00,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=967638.0, ans=0.0 2023-06-22 13:31:46,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.702e+02 3.094e+02 3.585e+02 5.689e+02, threshold=6.187e+02, percent-clipped=2.0 2023-06-22 13:31:48,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=967878.0, ans=0.125 2023-06-22 13:32:17,677 INFO [train.py:996] (0/4) Epoch 6, batch 8850, loss[loss=0.2764, simple_loss=0.3565, pruned_loss=0.09818, over 21388.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.317, pruned_loss=0.08018, over 4256028.23 frames. ], batch size: 131, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:32:21,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-22 13:32:23,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-06-22 13:33:03,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=968058.0, ans=0.0 2023-06-22 13:33:49,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=968118.0, ans=0.0 2023-06-22 13:33:59,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=968118.0, ans=0.0 2023-06-22 13:34:32,293 INFO [train.py:996] (0/4) Epoch 6, batch 8900, loss[loss=0.2299, simple_loss=0.299, pruned_loss=0.08039, over 21795.00 frames. ], tot_loss[loss=0.234, simple_loss=0.311, pruned_loss=0.0785, over 4250766.06 frames. ], batch size: 102, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:35:24,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=968298.0, ans=0.125 2023-06-22 13:36:25,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 2.633e+02 3.165e+02 3.746e+02 7.673e+02, threshold=6.331e+02, percent-clipped=6.0 2023-06-22 13:36:50,914 INFO [train.py:996] (0/4) Epoch 6, batch 8950, loss[loss=0.2016, simple_loss=0.2613, pruned_loss=0.07092, over 21212.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3105, pruned_loss=0.07824, over 4255764.80 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:37:13,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=968598.0, ans=0.125 2023-06-22 13:38:49,269 INFO [train.py:996] (0/4) Epoch 6, batch 9000, loss[loss=0.2027, simple_loss=0.2625, pruned_loss=0.07146, over 21815.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3036, pruned_loss=0.07783, over 4261896.27 frames. ], batch size: 124, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:38:49,271 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 13:39:41,075 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2635, simple_loss=0.3541, pruned_loss=0.08643, over 1796401.00 frames. 2023-06-22 13:39:41,077 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23877MB 2023-06-22 13:40:26,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-22 13:40:57,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=969018.0, ans=0.0 2023-06-22 13:40:59,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=969018.0, ans=0.125 2023-06-22 13:41:04,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.727e+02 3.183e+02 3.602e+02 6.441e+02, threshold=6.367e+02, percent-clipped=1.0 2023-06-22 13:41:36,434 INFO [train.py:996] (0/4) Epoch 6, batch 9050, loss[loss=0.2615, simple_loss=0.3388, pruned_loss=0.09206, over 21754.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3, pruned_loss=0.07506, over 4261206.90 frames. ], batch size: 124, lr: 5.20e-03, grad_scale: 16.0 2023-06-22 13:42:03,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=969138.0, ans=0.125 2023-06-22 13:42:18,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-22 13:42:57,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=969258.0, ans=0.2 2023-06-22 13:43:05,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=969318.0, ans=0.0 2023-06-22 13:43:38,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=969378.0, ans=0.1 2023-06-22 13:43:56,168 INFO [train.py:996] (0/4) Epoch 6, batch 9100, loss[loss=0.2022, simple_loss=0.2957, pruned_loss=0.05439, over 21309.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3072, pruned_loss=0.07788, over 4257433.79 frames. ], batch size: 176, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:44:53,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=969498.0, ans=0.0 2023-06-22 13:45:05,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=15.0 2023-06-22 13:45:17,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=969558.0, ans=0.2 2023-06-22 13:45:51,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.427e+02 2.866e+02 3.272e+02 6.065e+02, threshold=5.732e+02, percent-clipped=0.0 2023-06-22 13:46:21,659 INFO [train.py:996] (0/4) Epoch 6, batch 9150, loss[loss=0.226, simple_loss=0.3082, pruned_loss=0.07194, over 21450.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3081, pruned_loss=0.07441, over 4269365.16 frames. ], batch size: 211, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:47:32,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=969858.0, ans=0.125 2023-06-22 13:48:11,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-22 13:48:31,183 INFO [train.py:996] (0/4) Epoch 6, batch 9200, loss[loss=0.2706, simple_loss=0.3413, pruned_loss=0.09993, over 21814.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3107, pruned_loss=0.07411, over 4277999.04 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:49:44,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=970158.0, ans=0.0 2023-06-22 13:49:49,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=970218.0, ans=0.05 2023-06-22 13:50:20,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.602e+02 3.064e+02 3.617e+02 6.755e+02, threshold=6.128e+02, percent-clipped=8.0 2023-06-22 13:50:30,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=970278.0, ans=0.1 2023-06-22 13:50:39,027 INFO [train.py:996] (0/4) Epoch 6, batch 9250, loss[loss=0.2313, simple_loss=0.2965, pruned_loss=0.0831, over 21451.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.314, pruned_loss=0.07744, over 4280502.27 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:51:39,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=970458.0, ans=0.125 2023-06-22 13:51:39,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 13:52:49,820 INFO [train.py:996] (0/4) Epoch 6, batch 9300, loss[loss=0.1969, simple_loss=0.2681, pruned_loss=0.06289, over 21767.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3077, pruned_loss=0.07668, over 4275262.23 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 13:53:11,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970638.0, ans=0.0 2023-06-22 13:53:18,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-22 13:54:15,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-22 13:54:44,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.639e+02 3.001e+02 3.479e+02 6.527e+02, threshold=6.003e+02, percent-clipped=1.0 2023-06-22 13:55:04,080 INFO [train.py:996] (0/4) Epoch 6, batch 9350, loss[loss=0.2613, simple_loss=0.344, pruned_loss=0.08928, over 21805.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3132, pruned_loss=0.07774, over 4276291.57 frames. ], batch size: 118, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:55:28,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970938.0, ans=0.125 2023-06-22 13:56:37,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=971058.0, ans=0.0 2023-06-22 13:57:41,563 INFO [train.py:996] (0/4) Epoch 6, batch 9400, loss[loss=0.2354, simple_loss=0.2954, pruned_loss=0.0877, over 21532.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3157, pruned_loss=0.07811, over 4272052.40 frames. ], batch size: 441, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:58:23,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=971358.0, ans=0.0 2023-06-22 13:59:15,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.533e+02 2.854e+02 3.549e+02 7.944e+02, threshold=5.708e+02, percent-clipped=6.0 2023-06-22 13:59:45,780 INFO [train.py:996] (0/4) Epoch 6, batch 9450, loss[loss=0.2021, simple_loss=0.2642, pruned_loss=0.06996, over 21550.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3075, pruned_loss=0.07683, over 4267258.06 frames. ], batch size: 195, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 13:59:47,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=971538.0, ans=0.025 2023-06-22 14:00:00,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-22 14:02:06,236 INFO [train.py:996] (0/4) Epoch 6, batch 9500, loss[loss=0.1866, simple_loss=0.2745, pruned_loss=0.04934, over 21707.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2996, pruned_loss=0.07477, over 4264967.69 frames. ], batch size: 332, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:03:44,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=972018.0, ans=0.125 2023-06-22 14:03:52,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.465e+02 2.826e+02 3.384e+02 5.228e+02, threshold=5.653e+02, percent-clipped=0.0 2023-06-22 14:04:25,789 INFO [train.py:996] (0/4) Epoch 6, batch 9550, loss[loss=0.2244, simple_loss=0.312, pruned_loss=0.06838, over 16367.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3045, pruned_loss=0.07724, over 4261171.29 frames. ], batch size: 60, lr: 5.19e-03, grad_scale: 16.0 2023-06-22 14:05:21,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 14:05:32,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-22 14:05:53,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972318.0, ans=0.1 2023-06-22 14:06:44,198 INFO [train.py:996] (0/4) Epoch 6, batch 9600, loss[loss=0.2177, simple_loss=0.2961, pruned_loss=0.06968, over 21855.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3077, pruned_loss=0.07798, over 4268799.82 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 14:06:49,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=972438.0, ans=0.125 2023-06-22 14:07:31,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=972498.0, ans=0.125 2023-06-22 14:08:31,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.618e+02 2.907e+02 3.359e+02 5.518e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-22 14:09:09,933 INFO [train.py:996] (0/4) Epoch 6, batch 9650, loss[loss=0.2518, simple_loss=0.3251, pruned_loss=0.08921, over 21631.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3085, pruned_loss=0.07829, over 4267976.86 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-22 14:11:14,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=972978.0, ans=10.0 2023-06-22 14:11:27,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=972978.0, ans=0.0 2023-06-22 14:11:29,912 INFO [train.py:996] (0/4) Epoch 6, batch 9700, loss[loss=0.2407, simple_loss=0.3032, pruned_loss=0.08907, over 21831.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3096, pruned_loss=0.07814, over 4253848.00 frames. ], batch size: 112, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:11:56,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-22 14:13:01,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.601e+02 2.900e+02 3.367e+02 7.337e+02, threshold=5.800e+02, percent-clipped=1.0 2023-06-22 14:13:41,735 INFO [train.py:996] (0/4) Epoch 6, batch 9750, loss[loss=0.2009, simple_loss=0.2665, pruned_loss=0.06764, over 21635.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3025, pruned_loss=0.07686, over 4256498.65 frames. ], batch size: 298, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:15:02,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=973518.0, ans=0.125 2023-06-22 14:15:38,398 INFO [train.py:996] (0/4) Epoch 6, batch 9800, loss[loss=0.2552, simple_loss=0.3149, pruned_loss=0.09773, over 21594.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3032, pruned_loss=0.07701, over 4253648.28 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:15:56,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=973638.0, ans=0.0 2023-06-22 14:16:13,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973698.0, ans=0.1 2023-06-22 14:16:29,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=973758.0, ans=0.0 2023-06-22 14:16:52,950 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:17:00,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973818.0, ans=0.1 2023-06-22 14:17:27,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=973878.0, ans=0.04949747468305833 2023-06-22 14:17:28,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.469e+02 2.952e+02 3.754e+02 9.468e+02, threshold=5.905e+02, percent-clipped=4.0 2023-06-22 14:17:30,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=973878.0, ans=0.125 2023-06-22 14:17:42,961 INFO [train.py:996] (0/4) Epoch 6, batch 9850, loss[loss=0.1869, simple_loss=0.2476, pruned_loss=0.06307, over 21224.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2994, pruned_loss=0.077, over 4262022.41 frames. ], batch size: 176, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:18:20,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=973998.0, ans=0.125 2023-06-22 14:18:31,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=974058.0, ans=0.125 2023-06-22 14:18:47,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=974058.0, ans=0.2 2023-06-22 14:19:44,034 INFO [train.py:996] (0/4) Epoch 6, batch 9900, loss[loss=0.2641, simple_loss=0.3252, pruned_loss=0.1015, over 21576.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2958, pruned_loss=0.07664, over 4259411.64 frames. ], batch size: 414, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:20:24,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=974298.0, ans=0.0 2023-06-22 14:20:25,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.03 vs. limit=10.0 2023-06-22 14:20:27,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=974298.0, ans=0.125 2023-06-22 14:20:28,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=974298.0, ans=0.0 2023-06-22 14:21:10,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=974418.0, ans=0.0 2023-06-22 14:21:19,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=974418.0, ans=0.05 2023-06-22 14:21:25,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=974478.0, ans=0.0 2023-06-22 14:21:36,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.526e+02 2.876e+02 3.339e+02 4.860e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-22 14:21:55,177 INFO [train.py:996] (0/4) Epoch 6, batch 9950, loss[loss=0.2682, simple_loss=0.3067, pruned_loss=0.1148, over 21404.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2984, pruned_loss=0.07908, over 4266490.41 frames. ], batch size: 510, lr: 5.18e-03, grad_scale: 16.0 2023-06-22 14:22:21,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=974538.0, ans=0.1 2023-06-22 14:22:36,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=974598.0, ans=0.125 2023-06-22 14:23:47,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-22 14:23:56,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=974778.0, ans=0.2 2023-06-22 14:24:22,817 INFO [train.py:996] (0/4) Epoch 6, batch 10000, loss[loss=0.2419, simple_loss=0.2971, pruned_loss=0.09331, over 21454.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2957, pruned_loss=0.07803, over 4253249.86 frames. ], batch size: 509, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:24:45,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=974898.0, ans=0.125 2023-06-22 14:25:15,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=974958.0, ans=0.125 2023-06-22 14:25:38,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=974958.0, ans=0.0 2023-06-22 14:25:55,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=975078.0, ans=0.035 2023-06-22 14:25:56,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.535e+02 2.869e+02 3.521e+02 5.167e+02, threshold=5.738e+02, percent-clipped=0.0 2023-06-22 14:26:00,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=975078.0, ans=0.0 2023-06-22 14:26:29,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=975138.0, ans=0.125 2023-06-22 14:26:30,276 INFO [train.py:996] (0/4) Epoch 6, batch 10050, loss[loss=0.237, simple_loss=0.3035, pruned_loss=0.08525, over 21420.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2992, pruned_loss=0.07935, over 4258735.89 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:26:42,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-22 14:26:48,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-22 14:27:24,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-06-22 14:28:50,957 INFO [train.py:996] (0/4) Epoch 6, batch 10100, loss[loss=0.2682, simple_loss=0.3775, pruned_loss=0.07943, over 19853.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2964, pruned_loss=0.07651, over 4265397.10 frames. ], batch size: 702, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:29:35,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-22 14:30:14,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=975618.0, ans=10.0 2023-06-22 14:30:37,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=975618.0, ans=0.0 2023-06-22 14:30:41,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.495e+02 2.944e+02 3.852e+02 6.344e+02, threshold=5.889e+02, percent-clipped=1.0 2023-06-22 14:30:46,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.67 vs. limit=5.0 2023-06-22 14:30:48,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=975678.0, ans=0.125 2023-06-22 14:30:56,709 INFO [train.py:996] (0/4) Epoch 6, batch 10150, loss[loss=0.2481, simple_loss=0.3263, pruned_loss=0.0849, over 21885.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3017, pruned_loss=0.07851, over 4265006.73 frames. ], batch size: 371, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:31:57,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=975858.0, ans=0.2 2023-06-22 14:31:57,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=975858.0, ans=0.125 2023-06-22 14:32:11,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-22 14:33:07,626 INFO [train.py:996] (0/4) Epoch 6, batch 10200, loss[loss=0.1965, simple_loss=0.278, pruned_loss=0.05745, over 21605.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3014, pruned_loss=0.07637, over 4269545.90 frames. ], batch size: 263, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:33:10,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-22 14:35:04,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.209e+02 2.586e+02 3.021e+02 4.237e+02, threshold=5.173e+02, percent-clipped=0.0 2023-06-22 14:35:19,327 INFO [train.py:996] (0/4) Epoch 6, batch 10250, loss[loss=0.1827, simple_loss=0.2794, pruned_loss=0.04301, over 21791.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2943, pruned_loss=0.07009, over 4269912.13 frames. ], batch size: 333, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:36:17,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976398.0, ans=0.125 2023-06-22 14:36:31,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=976458.0, ans=0.1 2023-06-22 14:36:42,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976458.0, ans=0.125 2023-06-22 14:37:03,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=976518.0, ans=0.0 2023-06-22 14:37:49,228 INFO [train.py:996] (0/4) Epoch 6, batch 10300, loss[loss=0.1671, simple_loss=0.2552, pruned_loss=0.03947, over 21864.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2963, pruned_loss=0.07101, over 4275502.68 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 32.0 2023-06-22 14:38:32,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.48 vs. limit=10.0 2023-06-22 14:38:33,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=976698.0, ans=0.125 2023-06-22 14:39:09,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=976818.0, ans=0.0 2023-06-22 14:39:24,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 2.460e+02 2.831e+02 3.476e+02 5.397e+02, threshold=5.661e+02, percent-clipped=1.0 2023-06-22 14:40:01,748 INFO [train.py:996] (0/4) Epoch 6, batch 10350, loss[loss=0.1868, simple_loss=0.2572, pruned_loss=0.05817, over 21664.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2991, pruned_loss=0.07244, over 4276944.30 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:41:11,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-22 14:41:14,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=977058.0, ans=0.0 2023-06-22 14:41:20,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977058.0, ans=0.125 2023-06-22 14:42:16,431 INFO [train.py:996] (0/4) Epoch 6, batch 10400, loss[loss=0.2914, simple_loss=0.3487, pruned_loss=0.1171, over 21529.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2927, pruned_loss=0.07063, over 4261347.89 frames. ], batch size: 509, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:42:31,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=977238.0, ans=0.125 2023-06-22 14:42:31,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=977238.0, ans=0.0 2023-06-22 14:42:33,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=977238.0, ans=0.2 2023-06-22 14:43:24,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=977358.0, ans=0.07 2023-06-22 14:44:05,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.727e+02 3.248e+02 3.919e+02 5.926e+02, threshold=6.497e+02, percent-clipped=3.0 2023-06-22 14:44:12,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=977478.0, ans=0.125 2023-06-22 14:44:33,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977478.0, ans=0.1 2023-06-22 14:44:37,182 INFO [train.py:996] (0/4) Epoch 6, batch 10450, loss[loss=0.2386, simple_loss=0.3129, pruned_loss=0.08218, over 20660.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2975, pruned_loss=0.07424, over 4271452.83 frames. ], batch size: 607, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:44:37,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977538.0, ans=0.1 2023-06-22 14:45:15,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=977598.0, ans=0.5 2023-06-22 14:45:44,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=977658.0, ans=0.04949747468305833 2023-06-22 14:46:22,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=977718.0, ans=0.0 2023-06-22 14:46:38,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=977778.0, ans=0.125 2023-06-22 14:46:55,147 INFO [train.py:996] (0/4) Epoch 6, batch 10500, loss[loss=0.2296, simple_loss=0.2982, pruned_loss=0.08051, over 15760.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2967, pruned_loss=0.07309, over 4270063.16 frames. ], batch size: 60, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:48:04,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=977958.0, ans=0.125 2023-06-22 14:48:07,467 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:48:31,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.295e+02 2.560e+02 3.007e+02 4.435e+02, threshold=5.120e+02, percent-clipped=0.0 2023-06-22 14:49:11,526 INFO [train.py:996] (0/4) Epoch 6, batch 10550, loss[loss=0.1856, simple_loss=0.2478, pruned_loss=0.06172, over 21652.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2905, pruned_loss=0.07265, over 4266983.03 frames. ], batch size: 264, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:49:26,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=978138.0, ans=0.0 2023-06-22 14:49:35,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-22 14:51:24,202 INFO [train.py:996] (0/4) Epoch 6, batch 10600, loss[loss=0.1842, simple_loss=0.2616, pruned_loss=0.05342, over 21256.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2876, pruned_loss=0.07138, over 4270809.99 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 14:51:27,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-22 14:51:45,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=978498.0, ans=0.0 2023-06-22 14:52:00,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978558.0, ans=0.1 2023-06-22 14:52:02,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=978558.0, ans=0.0 2023-06-22 14:52:20,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=978558.0, ans=0.0 2023-06-22 14:52:50,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=978618.0, ans=0.125 2023-06-22 14:53:21,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.439e+02 2.926e+02 3.580e+02 7.545e+02, threshold=5.851e+02, percent-clipped=4.0 2023-06-22 14:53:38,921 INFO [train.py:996] (0/4) Epoch 6, batch 10650, loss[loss=0.161, simple_loss=0.2451, pruned_loss=0.03849, over 21612.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2889, pruned_loss=0.06952, over 4263993.47 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:54:23,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-22 14:55:40,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=978978.0, ans=0.0 2023-06-22 14:55:49,529 INFO [train.py:996] (0/4) Epoch 6, batch 10700, loss[loss=0.2473, simple_loss=0.3191, pruned_loss=0.0877, over 21197.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2901, pruned_loss=0.07026, over 4252569.28 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:57:02,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=979158.0, ans=0.1 2023-06-22 14:57:40,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=979218.0, ans=0.0 2023-06-22 14:57:59,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979278.0, ans=0.0 2023-06-22 14:58:02,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.582e+02 2.877e+02 3.268e+02 5.588e+02, threshold=5.755e+02, percent-clipped=0.0 2023-06-22 14:58:03,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=979278.0, ans=0.125 2023-06-22 14:58:04,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=979278.0, ans=0.0 2023-06-22 14:58:17,284 INFO [train.py:996] (0/4) Epoch 6, batch 10750, loss[loss=0.2233, simple_loss=0.3021, pruned_loss=0.07223, over 21373.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3005, pruned_loss=0.07413, over 4258969.86 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 16.0 2023-06-22 14:58:47,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=979398.0, ans=0.0 2023-06-22 14:59:03,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=979398.0, ans=0.125 2023-06-22 15:00:03,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.23 vs. limit=22.5 2023-06-22 15:00:09,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-22 15:00:12,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=979518.0, ans=0.0 2023-06-22 15:00:35,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=979578.0, ans=0.0 2023-06-22 15:00:52,824 INFO [train.py:996] (0/4) Epoch 6, batch 10800, loss[loss=0.2516, simple_loss=0.3215, pruned_loss=0.09082, over 21353.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3047, pruned_loss=0.07487, over 4262237.87 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:02:07,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-22 15:02:07,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=979818.0, ans=0.125 2023-06-22 15:02:16,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=979818.0, ans=0.125 2023-06-22 15:02:44,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.781e+02 3.238e+02 4.047e+02 6.056e+02, threshold=6.476e+02, percent-clipped=1.0 2023-06-22 15:03:16,982 INFO [train.py:996] (0/4) Epoch 6, batch 10850, loss[loss=0.1986, simple_loss=0.2768, pruned_loss=0.06017, over 21788.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3054, pruned_loss=0.07509, over 4260086.58 frames. ], batch size: 102, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:03:31,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=979998.0, ans=0.125 2023-06-22 15:03:35,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=979998.0, ans=0.1 2023-06-22 15:05:01,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=980178.0, ans=0.0 2023-06-22 15:05:23,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-22 15:05:23,920 INFO [train.py:996] (0/4) Epoch 6, batch 10900, loss[loss=0.2099, simple_loss=0.2947, pruned_loss=0.06256, over 21445.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2992, pruned_loss=0.07334, over 4261494.41 frames. ], batch size: 194, lr: 5.17e-03, grad_scale: 32.0 2023-06-22 15:05:27,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=980238.0, ans=0.125 2023-06-22 15:06:13,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=980298.0, ans=0.125 2023-06-22 15:06:20,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=980298.0, ans=0.125 2023-06-22 15:06:30,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=980358.0, ans=0.0 2023-06-22 15:06:46,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-22 15:06:47,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=980418.0, ans=0.125 2023-06-22 15:07:09,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.391e+02 2.681e+02 3.118e+02 5.164e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-22 15:07:34,780 INFO [train.py:996] (0/4) Epoch 6, batch 10950, loss[loss=0.204, simple_loss=0.2709, pruned_loss=0.06855, over 21142.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2924, pruned_loss=0.07115, over 4258203.35 frames. ], batch size: 143, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:10:01,903 INFO [train.py:996] (0/4) Epoch 6, batch 11000, loss[loss=0.2342, simple_loss=0.301, pruned_loss=0.08373, over 21593.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2915, pruned_loss=0.07196, over 4258589.97 frames. ], batch size: 212, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:10:09,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=980838.0, ans=0.125 2023-06-22 15:10:48,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=980958.0, ans=0.125 2023-06-22 15:10:51,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=980958.0, ans=0.0 2023-06-22 15:10:55,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=980958.0, ans=0.125 2023-06-22 15:10:58,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=980958.0, ans=0.125 2023-06-22 15:11:01,073 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:11:39,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.522e+02 2.854e+02 3.360e+02 5.643e+02, threshold=5.707e+02, percent-clipped=1.0 2023-06-22 15:11:55,624 INFO [train.py:996] (0/4) Epoch 6, batch 11050, loss[loss=0.2135, simple_loss=0.2792, pruned_loss=0.07385, over 21793.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2904, pruned_loss=0.07343, over 4271423.98 frames. ], batch size: 112, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:12:14,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=981138.0, ans=0.125 2023-06-22 15:12:15,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=981138.0, ans=0.0 2023-06-22 15:12:20,308 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:12:53,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=981198.0, ans=0.0 2023-06-22 15:13:26,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-22 15:14:05,803 INFO [train.py:996] (0/4) Epoch 6, batch 11100, loss[loss=0.2029, simple_loss=0.2803, pruned_loss=0.06275, over 21412.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.289, pruned_loss=0.07343, over 4276438.21 frames. ], batch size: 211, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:14:24,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-22 15:15:29,798 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:15:30,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=22.5 2023-06-22 15:15:31,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=981618.0, ans=0.125 2023-06-22 15:15:34,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-22 15:15:46,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=12.0 2023-06-22 15:15:51,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.464e+02 2.836e+02 3.391e+02 6.300e+02, threshold=5.672e+02, percent-clipped=1.0 2023-06-22 15:15:59,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=981678.0, ans=0.2 2023-06-22 15:16:19,349 INFO [train.py:996] (0/4) Epoch 6, batch 11150, loss[loss=0.1589, simple_loss=0.2247, pruned_loss=0.04655, over 16037.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.287, pruned_loss=0.07294, over 4244703.29 frames. ], batch size: 61, lr: 5.16e-03, grad_scale: 16.0 2023-06-22 15:17:32,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=981858.0, ans=0.0 2023-06-22 15:17:40,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=981858.0, ans=0.0 2023-06-22 15:18:19,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=981978.0, ans=0.0 2023-06-22 15:18:36,073 INFO [train.py:996] (0/4) Epoch 6, batch 11200, loss[loss=0.218, simple_loss=0.2876, pruned_loss=0.07421, over 21746.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2868, pruned_loss=0.07242, over 4254826.75 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:18:48,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-22 15:18:49,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=982038.0, ans=0.1 2023-06-22 15:19:34,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=982098.0, ans=0.125 2023-06-22 15:19:49,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.39 vs. limit=22.5 2023-06-22 15:19:55,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-22 15:20:20,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=982278.0, ans=0.025 2023-06-22 15:20:21,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.424e+02 2.651e+02 3.050e+02 4.953e+02, threshold=5.302e+02, percent-clipped=0.0 2023-06-22 15:20:44,983 INFO [train.py:996] (0/4) Epoch 6, batch 11250, loss[loss=0.2507, simple_loss=0.3099, pruned_loss=0.09571, over 21571.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2867, pruned_loss=0.07236, over 4252695.37 frames. ], batch size: 508, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:21:14,242 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:22:04,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-22 15:22:53,994 INFO [train.py:996] (0/4) Epoch 6, batch 11300, loss[loss=0.1686, simple_loss=0.2451, pruned_loss=0.04605, over 17002.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2887, pruned_loss=0.07291, over 4262260.66 frames. ], batch size: 63, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:24:23,067 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:24:48,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.408e+02 2.703e+02 3.080e+02 4.144e+02, threshold=5.407e+02, percent-clipped=0.0 2023-06-22 15:25:19,005 INFO [train.py:996] (0/4) Epoch 6, batch 11350, loss[loss=0.2294, simple_loss=0.3058, pruned_loss=0.07653, over 21246.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2889, pruned_loss=0.07204, over 4267557.26 frames. ], batch size: 143, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:25:32,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-22 15:25:43,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-22 15:27:42,753 INFO [train.py:996] (0/4) Epoch 6, batch 11400, loss[loss=0.1848, simple_loss=0.2506, pruned_loss=0.05954, over 16128.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2955, pruned_loss=0.07527, over 4266966.23 frames. ], batch size: 60, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:28:25,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-22 15:28:26,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=983298.0, ans=0.0 2023-06-22 15:28:41,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=983358.0, ans=0.1 2023-06-22 15:29:09,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=983418.0, ans=0.0 2023-06-22 15:29:37,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.455e+02 2.814e+02 3.249e+02 4.711e+02, threshold=5.629e+02, percent-clipped=0.0 2023-06-22 15:29:54,451 INFO [train.py:996] (0/4) Epoch 6, batch 11450, loss[loss=0.2448, simple_loss=0.3397, pruned_loss=0.07492, over 21731.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2981, pruned_loss=0.07481, over 4272534.14 frames. ], batch size: 415, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:30:18,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=983538.0, ans=0.0 2023-06-22 15:32:10,120 INFO [train.py:996] (0/4) Epoch 6, batch 11500, loss[loss=0.2371, simple_loss=0.3197, pruned_loss=0.07721, over 21615.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3007, pruned_loss=0.07615, over 4275682.62 frames. ], batch size: 414, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:32:36,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=983898.0, ans=0.0 2023-06-22 15:33:30,396 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-164000.pt 2023-06-22 15:33:36,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=983958.0, ans=0.0 2023-06-22 15:33:43,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-22 15:34:04,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.591e+02 3.004e+02 3.598e+02 5.267e+02, threshold=6.007e+02, percent-clipped=0.0 2023-06-22 15:34:38,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=984138.0, ans=0.125 2023-06-22 15:34:39,604 INFO [train.py:996] (0/4) Epoch 6, batch 11550, loss[loss=0.2396, simple_loss=0.3192, pruned_loss=0.08001, over 21286.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3049, pruned_loss=0.07508, over 4280035.30 frames. ], batch size: 176, lr: 5.16e-03, grad_scale: 32.0 2023-06-22 15:36:25,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=984318.0, ans=0.125 2023-06-22 15:36:48,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=984318.0, ans=0.125 2023-06-22 15:37:12,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=984438.0, ans=0.2 2023-06-22 15:37:12,901 INFO [train.py:996] (0/4) Epoch 6, batch 11600, loss[loss=0.224, simple_loss=0.3163, pruned_loss=0.0659, over 21530.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3203, pruned_loss=0.07815, over 4276392.18 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:37:46,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=984498.0, ans=0.0 2023-06-22 15:38:40,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=984618.0, ans=0.125 2023-06-22 15:39:22,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.936e+02 3.532e+02 4.287e+02 8.204e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-22 15:39:33,494 INFO [train.py:996] (0/4) Epoch 6, batch 11650, loss[loss=0.2738, simple_loss=0.3725, pruned_loss=0.08757, over 21886.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3245, pruned_loss=0.07812, over 4274363.62 frames. ], batch size: 317, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:39:41,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-22 15:41:18,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=984918.0, ans=0.2 2023-06-22 15:41:27,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=984918.0, ans=0.2 2023-06-22 15:41:38,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=22.5 2023-06-22 15:41:46,243 INFO [train.py:996] (0/4) Epoch 6, batch 11700, loss[loss=0.195, simple_loss=0.2593, pruned_loss=0.06529, over 21589.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3162, pruned_loss=0.07805, over 4270859.45 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:42:17,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=985098.0, ans=0.125 2023-06-22 15:42:33,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-22 15:42:56,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-22 15:42:56,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=22.5 2023-06-22 15:43:45,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.631e+02 2.879e+02 3.567e+02 8.503e+02, threshold=5.757e+02, percent-clipped=1.0 2023-06-22 15:43:54,599 INFO [train.py:996] (0/4) Epoch 6, batch 11750, loss[loss=0.2212, simple_loss=0.2783, pruned_loss=0.08202, over 21297.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3074, pruned_loss=0.07739, over 4272548.04 frames. ], batch size: 144, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:44:26,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=985338.0, ans=0.0 2023-06-22 15:46:36,567 INFO [train.py:996] (0/4) Epoch 6, batch 11800, loss[loss=0.22, simple_loss=0.3166, pruned_loss=0.06168, over 21722.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3078, pruned_loss=0.07872, over 4275578.01 frames. ], batch size: 298, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:46:39,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.84 vs. limit=10.0 2023-06-22 15:46:56,707 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:47:10,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-22 15:47:56,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-22 15:48:13,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=985758.0, ans=0.125 2023-06-22 15:48:17,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=985818.0, ans=0.07 2023-06-22 15:48:34,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-22 15:48:37,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.436e+02 2.794e+02 3.118e+02 5.023e+02, threshold=5.587e+02, percent-clipped=0.0 2023-06-22 15:48:58,501 INFO [train.py:996] (0/4) Epoch 6, batch 11850, loss[loss=0.2198, simple_loss=0.3281, pruned_loss=0.05573, over 20773.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3103, pruned_loss=0.07841, over 4280906.08 frames. ], batch size: 608, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:48:59,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=985938.0, ans=0.125 2023-06-22 15:50:17,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=986058.0, ans=0.015 2023-06-22 15:50:28,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986058.0, ans=0.1 2023-06-22 15:50:49,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=986178.0, ans=0.125 2023-06-22 15:51:24,789 INFO [train.py:996] (0/4) Epoch 6, batch 11900, loss[loss=0.2292, simple_loss=0.306, pruned_loss=0.07622, over 21662.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3102, pruned_loss=0.0764, over 4276893.95 frames. ], batch size: 332, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:51:54,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=986238.0, ans=0.05 2023-06-22 15:52:35,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=986358.0, ans=0.0 2023-06-22 15:52:58,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=986418.0, ans=0.125 2023-06-22 15:53:04,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-22 15:53:13,470 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.309e+02 2.613e+02 2.997e+02 4.619e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-22 15:53:41,841 INFO [train.py:996] (0/4) Epoch 6, batch 11950, loss[loss=0.2213, simple_loss=0.3209, pruned_loss=0.0608, over 21692.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3131, pruned_loss=0.07403, over 4271155.21 frames. ], batch size: 414, lr: 5.15e-03, grad_scale: 16.0 2023-06-22 15:54:52,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=986658.0, ans=0.0 2023-06-22 15:54:55,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-22 15:55:52,245 INFO [train.py:996] (0/4) Epoch 6, batch 12000, loss[loss=0.2165, simple_loss=0.277, pruned_loss=0.07794, over 21225.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3067, pruned_loss=0.07217, over 4273967.80 frames. ], batch size: 144, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:55:52,247 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 15:56:31,127 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.1261, 2.6594, 4.4203, 2.3961], device='cuda:0') 2023-06-22 15:56:35,825 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2631, simple_loss=0.3525, pruned_loss=0.08686, over 1796401.00 frames. 2023-06-22 15:56:35,826 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23877MB 2023-06-22 15:56:47,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986838.0, ans=0.125 2023-06-22 15:56:59,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=986838.0, ans=0.125 2023-06-22 15:57:21,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=986898.0, ans=0.125 2023-06-22 15:57:26,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=986958.0, ans=0.2 2023-06-22 15:58:18,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.546e+02 3.142e+02 3.620e+02 6.312e+02, threshold=6.283e+02, percent-clipped=4.0 2023-06-22 15:58:41,700 INFO [train.py:996] (0/4) Epoch 6, batch 12050, loss[loss=0.2153, simple_loss=0.2827, pruned_loss=0.07394, over 21863.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3032, pruned_loss=0.07374, over 4279713.25 frames. ], batch size: 351, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 15:59:30,861 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:59:36,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=987258.0, ans=0.125 2023-06-22 15:59:39,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=987258.0, ans=0.0 2023-06-22 16:00:32,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=987378.0, ans=0.125 2023-06-22 16:00:53,306 INFO [train.py:996] (0/4) Epoch 6, batch 12100, loss[loss=0.2417, simple_loss=0.318, pruned_loss=0.08265, over 21642.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3075, pruned_loss=0.0782, over 4285452.23 frames. ], batch size: 230, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:01:11,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=987438.0, ans=0.04949747468305833 2023-06-22 16:02:18,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=987558.0, ans=0.0 2023-06-22 16:02:27,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=987618.0, ans=0.125 2023-06-22 16:02:29,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=987618.0, ans=0.125 2023-06-22 16:03:14,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.726e+02 3.141e+02 3.562e+02 5.633e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-22 16:03:24,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=987678.0, ans=0.125 2023-06-22 16:03:33,318 INFO [train.py:996] (0/4) Epoch 6, batch 12150, loss[loss=0.2502, simple_loss=0.351, pruned_loss=0.07469, over 21295.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3098, pruned_loss=0.07689, over 4273460.78 frames. ], batch size: 548, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:04:56,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=987858.0, ans=0.125 2023-06-22 16:04:59,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=987918.0, ans=0.125 2023-06-22 16:05:01,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=987918.0, ans=0.0 2023-06-22 16:05:18,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 16:05:52,879 INFO [train.py:996] (0/4) Epoch 6, batch 12200, loss[loss=0.2079, simple_loss=0.2702, pruned_loss=0.07277, over 21491.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3047, pruned_loss=0.07552, over 4271982.98 frames. ], batch size: 391, lr: 5.15e-03, grad_scale: 32.0 2023-06-22 16:05:54,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=988038.0, ans=0.0 2023-06-22 16:06:05,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.27 vs. limit=15.0 2023-06-22 16:06:08,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=988098.0, ans=0.1 2023-06-22 16:06:11,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=988098.0, ans=0.2 2023-06-22 16:06:42,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=988158.0, ans=0.0 2023-06-22 16:07:15,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=988218.0, ans=0.125 2023-06-22 16:07:56,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 2.303e+02 2.808e+02 3.526e+02 6.344e+02, threshold=5.616e+02, percent-clipped=1.0 2023-06-22 16:08:03,854 INFO [train.py:996] (0/4) Epoch 6, batch 12250, loss[loss=0.1958, simple_loss=0.276, pruned_loss=0.05784, over 21489.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2979, pruned_loss=0.07207, over 4261399.93 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:08:05,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=988338.0, ans=10.0 2023-06-22 16:08:48,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=988458.0, ans=0.2 2023-06-22 16:09:00,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-22 16:10:11,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=988578.0, ans=0.125 2023-06-22 16:10:13,410 INFO [train.py:996] (0/4) Epoch 6, batch 12300, loss[loss=0.184, simple_loss=0.265, pruned_loss=0.05145, over 21216.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.291, pruned_loss=0.06708, over 4257666.53 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:10:13,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=988638.0, ans=0.07 2023-06-22 16:10:16,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-22 16:11:37,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=988818.0, ans=0.95 2023-06-22 16:11:58,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=988818.0, ans=0.125 2023-06-22 16:12:12,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=988878.0, ans=0.125 2023-06-22 16:12:22,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=988878.0, ans=0.125 2023-06-22 16:12:28,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.976e+02 2.467e+02 3.000e+02 5.737e+02, threshold=4.934e+02, percent-clipped=1.0 2023-06-22 16:12:34,333 INFO [train.py:996] (0/4) Epoch 6, batch 12350, loss[loss=0.2285, simple_loss=0.3092, pruned_loss=0.0739, over 21834.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2966, pruned_loss=0.06792, over 4261191.91 frames. ], batch size: 118, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:12:41,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=988938.0, ans=0.2 2023-06-22 16:13:03,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=988998.0, ans=0.125 2023-06-22 16:13:38,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-22 16:13:45,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 16:13:47,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-22 16:14:15,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=989178.0, ans=0.0 2023-06-22 16:14:34,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=989178.0, ans=0.0 2023-06-22 16:14:36,938 INFO [train.py:996] (0/4) Epoch 6, batch 12400, loss[loss=0.2286, simple_loss=0.3043, pruned_loss=0.07644, over 21895.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2983, pruned_loss=0.07155, over 4271618.86 frames. ], batch size: 124, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:16:40,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=989478.0, ans=0.125 2023-06-22 16:16:45,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.525e+02 2.978e+02 3.587e+02 4.939e+02, threshold=5.956e+02, percent-clipped=1.0 2023-06-22 16:16:51,794 INFO [train.py:996] (0/4) Epoch 6, batch 12450, loss[loss=0.261, simple_loss=0.3257, pruned_loss=0.09813, over 21378.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3014, pruned_loss=0.07449, over 4272426.92 frames. ], batch size: 548, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:17:17,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=22.5 2023-06-22 16:17:53,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0 2023-06-22 16:18:32,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=989658.0, ans=0.025 2023-06-22 16:18:40,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=989658.0, ans=0.07 2023-06-22 16:18:55,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=989718.0, ans=0.125 2023-06-22 16:19:14,378 INFO [train.py:996] (0/4) Epoch 6, batch 12500, loss[loss=0.2439, simple_loss=0.3317, pruned_loss=0.07807, over 21471.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.314, pruned_loss=0.07834, over 4279722.90 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:19:39,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=989838.0, ans=0.0 2023-06-22 16:19:45,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=15.0 2023-06-22 16:20:10,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=989898.0, ans=0.0 2023-06-22 16:20:56,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-22 16:21:38,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.650e+02 2.940e+02 3.345e+02 4.530e+02, threshold=5.879e+02, percent-clipped=0.0 2023-06-22 16:22:04,954 INFO [train.py:996] (0/4) Epoch 6, batch 12550, loss[loss=0.3135, simple_loss=0.3728, pruned_loss=0.1271, over 21323.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3189, pruned_loss=0.08064, over 4276266.93 frames. ], batch size: 507, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:22:32,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=990198.0, ans=0.2 2023-06-22 16:23:02,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-22 16:23:31,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990318.0, ans=0.1 2023-06-22 16:24:08,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=990378.0, ans=0.2 2023-06-22 16:24:14,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-22 16:24:22,784 INFO [train.py:996] (0/4) Epoch 6, batch 12600, loss[loss=0.1924, simple_loss=0.2877, pruned_loss=0.04855, over 21671.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3161, pruned_loss=0.07818, over 4279349.72 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:24:50,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=990498.0, ans=0.0 2023-06-22 16:25:28,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=990618.0, ans=0.125 2023-06-22 16:25:34,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=990618.0, ans=0.0 2023-06-22 16:26:21,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.398e+02 2.729e+02 3.427e+02 5.536e+02, threshold=5.458e+02, percent-clipped=0.0 2023-06-22 16:26:33,020 INFO [train.py:996] (0/4) Epoch 6, batch 12650, loss[loss=0.2382, simple_loss=0.3047, pruned_loss=0.08585, over 21879.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3087, pruned_loss=0.07446, over 4282251.55 frames. ], batch size: 124, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:27:15,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.72 vs. limit=10.0 2023-06-22 16:27:18,103 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-22 16:27:24,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=990858.0, ans=0.0 2023-06-22 16:27:26,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=990858.0, ans=0.125 2023-06-22 16:27:45,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=990918.0, ans=0.125 2023-06-22 16:27:54,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=990918.0, ans=0.125 2023-06-22 16:28:44,044 INFO [train.py:996] (0/4) Epoch 6, batch 12700, loss[loss=0.3012, simple_loss=0.348, pruned_loss=0.1272, over 21538.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3085, pruned_loss=0.07642, over 4287937.45 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:29:17,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=991098.0, ans=0.125 2023-06-22 16:29:31,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-22 16:29:50,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991158.0, ans=0.1 2023-06-22 16:30:47,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=991278.0, ans=0.0 2023-06-22 16:30:48,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.637e+02 2.986e+02 3.360e+02 4.638e+02, threshold=5.971e+02, percent-clipped=0.0 2023-06-22 16:31:05,011 INFO [train.py:996] (0/4) Epoch 6, batch 12750, loss[loss=0.2469, simple_loss=0.3182, pruned_loss=0.08776, over 19883.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3102, pruned_loss=0.07718, over 4277710.16 frames. ], batch size: 702, lr: 5.14e-03, grad_scale: 8.0 2023-06-22 16:31:16,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=991338.0, ans=0.125 2023-06-22 16:32:09,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-22 16:32:28,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=991518.0, ans=0.2 2023-06-22 16:33:12,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=991638.0, ans=0.125 2023-06-22 16:33:13,745 INFO [train.py:996] (0/4) Epoch 6, batch 12800, loss[loss=0.2723, simple_loss=0.343, pruned_loss=0.1008, over 21444.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3106, pruned_loss=0.07837, over 4286384.14 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:34:28,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-22 16:34:32,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=991818.0, ans=0.0 2023-06-22 16:35:20,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.507e+02 2.757e+02 3.011e+02 5.577e+02, threshold=5.515e+02, percent-clipped=0.0 2023-06-22 16:35:22,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-22 16:35:25,516 INFO [train.py:996] (0/4) Epoch 6, batch 12850, loss[loss=0.2063, simple_loss=0.3013, pruned_loss=0.05567, over 21899.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3122, pruned_loss=0.07973, over 4288941.36 frames. ], batch size: 316, lr: 5.14e-03, grad_scale: 16.0 2023-06-22 16:35:26,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=991938.0, ans=0.2 2023-06-22 16:35:30,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=991938.0, ans=0.125 2023-06-22 16:35:31,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-22 16:35:55,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=991998.0, ans=22.5 2023-06-22 16:36:28,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=992058.0, ans=0.2 2023-06-22 16:37:36,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=992178.0, ans=0.125 2023-06-22 16:37:40,705 INFO [train.py:996] (0/4) Epoch 6, batch 12900, loss[loss=0.1989, simple_loss=0.2836, pruned_loss=0.05711, over 21597.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3096, pruned_loss=0.07581, over 4282116.28 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:38:16,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=992298.0, ans=0.0 2023-06-22 16:38:21,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-22 16:38:25,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=992298.0, ans=0.0 2023-06-22 16:39:32,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-22 16:39:35,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=992418.0, ans=0.0 2023-06-22 16:39:40,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=992478.0, ans=0.0 2023-06-22 16:39:50,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.250e+02 2.553e+02 2.829e+02 4.968e+02, threshold=5.106e+02, percent-clipped=0.0 2023-06-22 16:39:54,898 INFO [train.py:996] (0/4) Epoch 6, batch 12950, loss[loss=0.1876, simple_loss=0.2752, pruned_loss=0.04996, over 21744.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3069, pruned_loss=0.0739, over 4276376.41 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:40:20,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=992538.0, ans=0.0 2023-06-22 16:40:29,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=992598.0, ans=0.2 2023-06-22 16:40:42,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=992598.0, ans=0.125 2023-06-22 16:40:45,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=992598.0, ans=0.125 2023-06-22 16:41:53,778 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:41:55,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-22 16:41:58,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992778.0, ans=0.1 2023-06-22 16:41:58,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=22.5 2023-06-22 16:42:11,632 INFO [train.py:996] (0/4) Epoch 6, batch 13000, loss[loss=0.1925, simple_loss=0.2733, pruned_loss=0.05589, over 21684.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3058, pruned_loss=0.0737, over 4275587.77 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:42:39,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=992898.0, ans=0.035 2023-06-22 16:42:39,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-22 16:43:25,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992958.0, ans=0.1 2023-06-22 16:43:54,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=22.5 2023-06-22 16:44:17,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-22 16:44:19,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.566e+02 2.996e+02 3.463e+02 5.052e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-22 16:44:24,271 INFO [train.py:996] (0/4) Epoch 6, batch 13050, loss[loss=0.2691, simple_loss=0.3236, pruned_loss=0.1073, over 21639.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3017, pruned_loss=0.07172, over 4268342.37 frames. ], batch size: 471, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:45:23,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=993258.0, ans=0.0 2023-06-22 16:46:09,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=993318.0, ans=0.125 2023-06-22 16:46:38,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=993378.0, ans=0.0 2023-06-22 16:46:42,708 INFO [train.py:996] (0/4) Epoch 6, batch 13100, loss[loss=0.2815, simple_loss=0.3544, pruned_loss=0.1043, over 21175.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3042, pruned_loss=0.07126, over 4277825.29 frames. ], batch size: 143, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:47:22,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-22 16:47:30,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=993498.0, ans=0.125 2023-06-22 16:47:39,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=993498.0, ans=0.125 2023-06-22 16:48:18,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-22 16:48:30,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993618.0, ans=0.1 2023-06-22 16:48:30,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=993618.0, ans=0.125 2023-06-22 16:49:00,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.692e+02 3.327e+02 4.339e+02 6.233e+02, threshold=6.654e+02, percent-clipped=2.0 2023-06-22 16:49:18,893 INFO [train.py:996] (0/4) Epoch 6, batch 13150, loss[loss=0.3323, simple_loss=0.4561, pruned_loss=0.1043, over 19788.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3085, pruned_loss=0.07453, over 4275199.22 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-22 16:50:37,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=993858.0, ans=0.0 2023-06-22 16:51:25,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=993978.0, ans=0.0 2023-06-22 16:51:34,353 INFO [train.py:996] (0/4) Epoch 6, batch 13200, loss[loss=0.2425, simple_loss=0.307, pruned_loss=0.08902, over 21289.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3069, pruned_loss=0.07542, over 4275370.08 frames. ], batch size: 176, lr: 5.13e-03, grad_scale: 32.0 2023-06-22 16:52:13,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-22 16:52:42,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=994158.0, ans=0.2 2023-06-22 16:52:45,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=994158.0, ans=0.125 2023-06-22 16:53:15,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-22 16:53:45,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.879e+02 3.252e+02 3.649e+02 5.858e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-22 16:53:56,116 INFO [train.py:996] (0/4) Epoch 6, batch 13250, loss[loss=0.2143, simple_loss=0.2981, pruned_loss=0.06526, over 21525.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3065, pruned_loss=0.07718, over 4276115.30 frames. ], batch size: 131, lr: 5.13e-03, grad_scale: 32.0 2023-06-22 16:53:56,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=994338.0, ans=0.09899494936611666 2023-06-22 16:54:04,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-22 16:54:40,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=994398.0, ans=0.05